Analysis & Predictive Modeling
Pertemuan 3
Pendahuluan
Pendahuluan
Regresi logistik multinomial (Multinomial Logistic Regression — MLR) adalah teknik pemodelan yang digunakan ketika variabel target (dependen) bersifat kategorikal dengan lebih dari dua kelas dan kategori-kategori tersebut tidak berurutan (nominal). Berbeda dengan regresi logistik biner yang hanya memisahkan dua kelas, MLR memodelkan probabilitas setiap kelas dengan cara mengekspresikan log-odds relatif terhadap sebuah kelas referensi (baseline).
Pada analisis ini kita menggunakan dataset yang memiliki variabel target bernama Success_Level dengan tiga kategori: Low, Medium, High. Variabel prediktor (independen) yang dipakai adalah:
- Advertising — anggaran iklan (misal unit ribuan atau skala apa pun sesuai dataset),
- Salespeople — jumlah tenaga penjual / sales staff,
- Satisfaction — tingkat kepuasan pelanggan (skor),
- Competition — tingkat kompetisi (skala).
Tujuan analisis:
Menjelaskan konsep MLR secara matematis dan intuitif.
Membangun model MLR menggunakan nnet::multinom.
Menyajikan dua visualisasi interaktif yang membantu interpretasi:
- Heatmap confusion matrix interaktif,
- Plot probabilitas prediksi interaktif (stacked per observasi).
Memberi interpretasi hasil yang jelas dan aplikatif untuk pengambilan keputusan.
Rumus Regresi Logistik Multinomial
Secara umum, persamaan untuk kategori ke-j adalah:
\[ \log\left(\frac{P(Y=j)}{P(Y=\text{base})}\right) = \beta_{0j} + \beta_{1j}X_1 + \beta_{2j}X_2 + ... + \beta_{pj}X_p \] — —
1. Persiapan Paket
library(nnet)
library(dplyr)
library(tidyr)
library(ggplot2)
library(plotly)
library(broom)
library(readr)
library(forcats)
library(caret)
library(tibble)
set.seed(123)2. Pemanggilan dan Pembersihan Data
default_path <- "2 Regression Models - Analysis and Predictive Modeling (1).csv"
if (!file.exists(default_path)) {
csv_path <- file.choose()
} else {
csv_path <- default_path
}
read_safely <- function(path) {
encodings <- c("UTF-8-BOM", "latin1", "")
for (enc in encodings) {
result <- tryCatch(read.csv(path, stringsAsFactors = FALSE, fileEncoding = enc),
error = function(e) e)
if (!inherits(result, "error")) return(result)
}
stop("Gagal membaca CSV. Periksa encoding file.")
}
raw <- read_safely(csv_path)
safe_names <- iconv(names(raw), from = "", to = "ASCII//TRANSLIT", sub = "_")
safe_names <- gsub("[^A-Za-z0-9_]", "_", safe_names)
names(raw) <- safe_names
names(raw)## [1] "X" "Advertising" "Salespeople" "Satisfaction" "Competition"
## [6] "Success"
Penjelasan: Langkah ini memastikan tidak ada error encoding akibat karakter non-ASCII, dan nama kolom disesuaikan agar kompatibel untuk pemrosesan di R.
3. Praproses Data
data <- raw
data <- data %>% select(-matches("^(Unnamed|X$|X[0-9]+$|i__)"), everything())
if ("Success_Level" %in% names(data) && !("Success" %in% names(data))) {
data <- data %>% rename(Success = Success_Level)
}
expected_preds <- c("Advertising", "Salespeople", "Satisfaction", "Competition")
for (v in intersect(expected_preds, names(data))) {
data[[v]] <- suppressWarnings(as.numeric(data[[v]]))
}
data$Success <- as.factor(data$Success)
if ("Low" %in% levels(data$Success)) {
data$Success <- relevel(data$Success, ref = "Low")
}
data_model <- data %>% select(all_of(c("Success", expected_preds))) %>% na.omit()
summary(data_model)## Success Advertising Salespeople Satisfaction Competition
## 0: 16 Min. : 5.016 Min. :10.16 Min. :1.004 Min. :1.011
## 1:184 1st Qu.:11.803 1st Qu.:15.89 1st Qu.:3.462 1st Qu.:3.217
## Median :17.052 Median :21.73 Median :5.555 Median :5.429
## Mean :17.660 Mean :22.23 Mean :5.520 Mean :5.442
## 3rd Qu.:23.334 3rd Qu.:28.55 3rd Qu.:7.813 3rd Qu.:7.797
## Max. :29.857 Max. :34.99 Max. :9.970 Max. :9.954
Penjelasan: Semua prediktor dikonversi ke numerik. Variabel target diset sebagai faktor dengan Low sebagai kategori dasar.
4. Pembagian Data
split_idx <- createDataPartition(data_model$Success, p = 0.8, list = FALSE)
train <- data_model[split_idx, ]
test <- data_model[-split_idx, ]
cat("Train:", nrow(train), "baris | Test:", nrow(test), "baris")## Train: 161 baris | Test: 39 baris
Penjelasan: Dataset dibagi menjadi 80% data latih dan 20% data uji untuk menghindari overfitting.
5. Pembangunan Model
formula_multi <- Success ~ Advertising + Salespeople + Satisfaction + Competition
model_multi <- multinom(formula_multi, data = train, Hess = TRUE, trace = FALSE)
summary(model_multi)## Call:
## multinom(formula = formula_multi, data = train, Hess = TRUE,
## trace = FALSE)
##
## Coefficients:
## Values Std. Err.
## (Intercept) -5.6963955 1.93059767
## Advertising 0.2057820 0.06390454
## Salespeople 0.2338752 0.07448074
## Satisfaction 0.4390046 0.15221471
## Competition -0.2375119 0.15128701
##
## Residual Deviance: 50.97412
## AIC: 60.97412
Penjelasan: Model multinom() digunakan
untuk membangun regresi logistik multinomial dengan Success
sebagai target.
6. Ringkasan Koefisien dan Nilai Signifikansi
sm <- summary(model_multi)
coefs <- sm$coefficients
ses <- sm$standard.errors
z_vals <- coefs / ses
p_vals <- 2 * (1 - pnorm(abs(z_vals)))
coef_table <- as.data.frame(coefs) %>%
rownames_to_column(var = "Response") %>%
pivot_longer(-Response, names_to = "Term", values_to = "Estimate") %>%
left_join(as.data.frame(z_vals) %>% rownames_to_column("Response") %>%
pivot_longer(-Response, names_to="Term", values_to="z"),
by=c("Response","Term")) %>%
left_join(as.data.frame(p_vals) %>% rownames_to_column("Response") %>%
pivot_longer(-Response, names_to="Term", values_to="p_value"),
by=c("Response","Term")) %>%
mutate(OR = exp(Estimate)) %>%
arrange(Response, Term)
coef_table7. Evaluasi Model
pred_class <- predict(model_multi, newdata = test)
pred_prob <- predict(model_multi, newdata = test, type = "probs")
conf_matrix <- table(Predicted = pred_class, Actual = test$Success)
conf_matrix## Actual
## Predicted 0 1
## 0 0 1
## 1 3 35
## Akurasi Model: 89.74 %
8. Visualisasi Hasil
8.1 Confusion Matrix Heatmap
8.2 Stacked Probability Plot
prob_long <- as.data.frame(pred_prob) %>%
mutate(obs = 1:n()) %>%
pivot_longer(-obs, names_to="Category", values_to="Prob")
plot_ly(prob_long, x = ~obs, y = ~Prob, color = ~Category, type = "bar") %>%
layout(barmode = "stack", yaxis = list(title = "Probabilitas"),
title = "Predicted Probabilities for Each Observation")## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
8.3 Odds Ratio Plot
OR <- exp(coefs)
lower <- exp(coefs - 1.96 * ses)
upper <- exp(coefs + 1.96 * ses)
coef_or <- as.data.frame(OR) %>%
rownames_to_column(var = "Response") %>%
pivot_longer(-Response, names_to = "Term", values_to = "OR") %>%
left_join(as.data.frame(lower) %>% rownames_to_column("Response") %>%
pivot_longer(-Response, names_to = "Term", values_to = "Lower"),
by = c("Response","Term")) %>%
left_join(as.data.frame(upper) %>% rownames_to_column("Response") %>%
pivot_longer(-Response, names_to = "Term", values_to = "Upper"),
by = c("Response","Term"))
coef_or <- coef_or %>% filter(Term != "(Intercept)")
p <- ggplot(coef_or, aes(x = OR, y = Term, color = Response,
text = paste0("Response: ", Response,
"<br>Term: ", Term,
"<br>OR: ", round(OR,3),
"<br>95% CI: [", round(Lower,3),
", ", round(Upper,3), "]"))) +
geom_point() + geom_errorbarh(aes(xmin = Lower, xmax = Upper), height = 0.2) +
geom_vline(xintercept = 1, linetype = "dashed") +
xlab("Odds Ratio (exp(coef))") + ylab("Predictor") +
ggtitle("Odds Ratios per Predictor (Multinomial) - vs baseline") +
theme_minimal()
ggplotly(p, tooltip = "text")8.4 Visualisasi 3D Probabilitas Prediksi
pred_prob <- as.matrix(predict(model_multi, newdata = data, type = "probs"))
data$Prob_Low <- pred_prob[, 1]
if (ncol(pred_prob) >= 2) data$Prob_Medium <- pred_prob[, 2]
if (ncol(pred_prob) >= 3) data$Prob_High <- pred_prob[, 3]
plot_ly(
x = data$Advertising,
y = data$Satisfaction,
z = data$Prob_Low,
color = data$Success,
colors = c("red", "green", "blue"),
type = "scatter3d",
mode = "markers"
) %>%
layout(
scene = list(
xaxis = list(title = "Advertising"),
yaxis = list(title = "Satisfaction"),
zaxis = list(title = "Predicted Probability (Low)")
),
title = "3D Visualization of Predicted Probabilities for Class 'Low'"
)9. Ringkasan Visualisasi dan Interpretasi
| No | Visualisasi | Jenis Grafik | Tujuan Analisis | Interpretasi Utama |
|---|---|---|---|---|
| 1 | Confusion Matrix Heatmap | Heatmap interaktif | Menilai akurasi klasifikasi per kategori | Warna diagonal yang semakin gelap menunjukkan prediksi model semakin tepat untuk kelas tersebut. |
| 2 | Stacked Probability Plot | Grafik batang bertumpuk | Menampilkan probabilitas prediksi per observasi | Observasi dengan satu warna dominan = prediksi pasti; campuran = prediksi ambigu. |
| 3 | Odds Ratio Plot | Plot titik dengan CI | Menilai kekuatan pengaruh variabel | Titik kanan garis OR=1 menaikkan peluang; kiri menurunkan peluang. |
| 4 | 3D Probability Visualization | Scatter 3D | Melihat pola spasial probabilitas | Distribusi antar warna menunjukkan separasi antar kelas. |
10. Kesimpulan
- Model multinomial logistic regression dapat memisahkan kelas Success dengan akurasi baik.
- Variabel Advertising dan Satisfaction paling signifikan.
- Visualisasi 2D dan 3D memperlihatkan bagaimana model membedakan kelas berdasarkan pola data.
- Model ini berguna untuk memprediksi keberhasilan strategi pemasaran.
Referensi
| No | Penulis | Tahun | Judul | Sumber |
|---|---|---|---|---|
| 1 | Lipovetsky, S. | 2021 | Logistic and multinomial-logit models: A brief review on their modifications. | Mathematical & Computer Modelling of Dynamical Systems. |
| 2 | Liang, J., Bi, G., & Zhan, C. | 2019 | Multinomial and ordinal logistic regression analyses with multi-categorical variables using R. | Frontiers in Public Health. |
| 3 | Pate, A., et al. | 2022 | Minimum sample size for developing a multivariable prediction model using multinomial logistic regression. | arXiv preprint. |
| 4 | van Hoor de, K. et al. | 2016 | Validation and updating of risk models based on multinomial logistic regression. | Diagnostic and Prognostic Research. |
| 5 | Analysis and Predictive Modeling | - | Regression Models — Analysis and Predictive Modeling. | link |