| title: “Analisis Perbandingan Metode LDA vs OLR pada Kebugaran Fisik” |
| output: html_document: theme: flatly highlight: tango toc: true toc_float: true code_folding: show |
Laporan ini menyajikan analisis komparatif klasifikasi pada Body Performance Dataset. Fokus utama penelitian ini adalah mengevaluasi performa dua algoritme statistik dalam memprediksi tingkat kebugaran seseorang. Analisis dilakukan secara bertahap:
- Data Loading & Understanding — Mengenali karakteristik dataset dan tipe variabel.
- Data Preprocessing — Membersihkan data dari nilai hilang, duplikasi, dan pencilan (outliers).
- Exploratory Data Analysis (EDA) — Visualisasi distribusi dan korelasi antar variabel fisik.
- Linear Discriminant Analysis (LDA) — Pemodelan klasifikasi berbasis perbedaan rata-rata kelompok.
- Ordinal Logistic Regression (OLR) — Pemodelan klasifikasi untuk data kategori yang memiliki tingkatan (ordinal).
Data diambil dari Body Performance Dataset yang tersedia di Kaggle. Dataset ini berisi 13.393 baris data yang mencakup pengukuran fisik dan kemampuan atletik individu.
#packages <- c("kableExtra", "moments", "corrplot", "gridExtra",
# "biotools", "car", "energy", "MASS", "MVN")
#install.packages(packages)
#lapply(packages, library, character.only = TRUE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.5.3
## Warning: package 'readr' was built under R version 4.5.3
## Warning: package 'forcats' was built under R version 4.5.3
## Warning: package 'lubridate' was built under R version 4.5.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.0
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 4.5.3
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(MASS)
## Warning: package 'MASS' was built under R version 4.5.3
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(MVN)
## Warning: package 'MVN' was built under R version 4.5.3
library(biotools)
## Warning: package 'biotools' was built under R version 4.5.3
## ---
## biotools version 4.3
library(car)
## Warning: package 'car' was built under R version 4.5.3
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(moments)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.5.3
## corrplot 0.95 loaded
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.5.3
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(energy)
## Warning: package 'energy' was built under R version 4.5.3
# Membaca dataset
data <- read.csv("bodyPerformance.csv")
data$gender <- ifelse(data$gender == "M", 1, 0)
# Menampilkan 5 data teratas
knitr::kable(head(data, 5)) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| age | gender | height_cm | weight_kg | body.fat_. | diastolic | systolic | gripForce | sit.and.bend.forward_cm | sit.ups.counts | broad.jump_cm | class |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 27 | 1 | 172.3 | 75.24 | 21.3 | 80 | 130 | 54.9 | 18.4 | 60 | 217 | C |
| 25 | 1 | 165.0 | 55.80 | 15.7 | 77 | 126 | 36.4 | 16.3 | 53 | 229 | A |
| 31 | 1 | 179.6 | 78.00 | 20.1 | 92 | 152 | 44.8 | 12.0 | 49 | 181 | C |
| 32 | 1 | 174.5 | 71.10 | 18.4 | 76 | 147 | 41.4 | 15.2 | 53 | 219 | B |
| 28 | 1 | 173.8 | 67.70 | 17.1 | 70 | 127 | 43.5 | 27.1 | 45 | 217 | B |
str(data)
## 'data.frame': 13393 obs. of 12 variables:
## $ age : num 27 25 31 32 28 36 42 33 54 28 ...
## $ gender : num 1 1 1 1 1 0 0 1 1 1 ...
## $ height_cm : num 172 165 180 174 174 ...
## $ weight_kg : num 75.2 55.8 78 71.1 67.7 ...
## $ body.fat_. : num 21.3 15.7 20.1 18.4 17.1 22 32.2 36.9 27.6 14.4 ...
## $ diastolic : num 80 77 92 76 70 64 72 84 85 81 ...
## $ systolic : num 130 126 152 147 127 119 135 137 165 156 ...
## $ gripForce : num 54.9 36.4 44.8 41.4 43.5 23.8 22.7 45.9 40.4 57.9 ...
## $ sit.and.bend.forward_cm: num 18.4 16.3 12 15.2 27.1 21 0.8 12.3 18.6 12.1 ...
## $ sit.ups.counts : num 60 53 49 53 45 27 18 42 34 55 ...
## $ broad.jump_cm : num 217 229 181 219 217 153 146 234 148 213 ...
## $ class : chr "C" "A" "C" "B" ...
Keterangan variabel:
age | Usia: Usia peserta dalam tahun.
gender | Jenis Kelamin: Laki-laki (1) atau Perempuan (0).
gripForce | Kekuatan Genggaman: Indikator kekuatan otot tubuh bagian atas.
sit-ups counts | Jumlah Sit-up: Indikator daya tahan dan kekuatan otot perut.
broad jump_cm | Lompat Jauh: Indikator kekuatan ledak otot kaki.
class | Kelas Kebugaran: Kategori kebugaran dari A (terbaik) hingga D (terendah).
summary(data)
## age gender height_cm weight_kg
## Min. :21.00 Min. :0.0000 Min. :125.0 Min. : 26.30
## 1st Qu.:25.00 1st Qu.:0.0000 1st Qu.:162.4 1st Qu.: 58.20
## Median :32.00 Median :1.0000 Median :169.2 Median : 67.40
## Mean :36.78 Mean :0.6322 Mean :168.6 Mean : 67.45
## 3rd Qu.:48.00 3rd Qu.:1.0000 3rd Qu.:174.8 3rd Qu.: 75.30
## Max. :64.00 Max. :1.0000 Max. :193.8 Max. :138.10
## body.fat_. diastolic systolic gripForce
## Min. : 3.00 Min. : 0.0 Min. : 0.0 Min. : 0.00
## 1st Qu.:18.00 1st Qu.: 71.0 1st Qu.:120.0 1st Qu.:27.50
## Median :22.80 Median : 79.0 Median :130.0 Median :37.90
## Mean :23.24 Mean : 78.8 Mean :130.2 Mean :36.96
## 3rd Qu.:28.00 3rd Qu.: 86.0 3rd Qu.:141.0 3rd Qu.:45.20
## Max. :78.40 Max. :156.2 Max. :201.0 Max. :70.50
## sit.and.bend.forward_cm sit.ups.counts broad.jump_cm class
## Min. :-25.00 Min. : 0.00 Min. : 0.0 Length:13393
## 1st Qu.: 10.90 1st Qu.:30.00 1st Qu.:162.0 Class :character
## Median : 16.20 Median :41.00 Median :193.0 Mode :character
## Mean : 15.21 Mean :39.77 Mean :190.1
## 3rd Qu.: 20.70 3rd Qu.:50.00 3rd Qu.:221.0
## Max. :213.00 Max. :80.00 Max. :303.0
Preprocessing dilakukan untuk menjamin kualitas data sebelum masuk ke tahap pemodelan. Langkah ini mencakup pemeriksaan integritas data seperti dimensi, nilai yang hilang (missing values), dan keberadaan data duplikat.
# Cek Dimensi
cat("Dimensi data:", dim(data)[1], "x", dim(data)[2], "\n")
## Dimensi data: 13393 x 12
# Cek Missing Values
print("Nilai yang hilang per kolom: ")
## [1] "Nilai yang hilang per kolom: "
colSums(is.na(data))
## age gender height_cm
## 0 0 0
## weight_kg body.fat_. diastolic
## 0 0 0
## systolic gripForce sit.and.bend.forward_cm
## 0 0 0
## sit.ups.counts broad.jump_cm class
## 0 0 0
# Cek Duplikasi
sum(duplicated(data))
## [1] 1
# Penanganan Duplikasi
data <- data[!duplicated(data), ]
EDA dilakukan untuk memahami sebaran data secara visual. Kami memeriksa skewness (kemiringan) untuk mengetahui normalitas data serta melihat korelasi antar fitur fisik guna mengidentifikasi hubungan antar variabel prediktor.
# Identifikasi kolom numerik
numeric_cols <- data %>% dplyr::select(where(is.numeric))
skew_df <- data.frame(
feature = names(numeric_cols),
skewness = sapply(numeric_cols, skewness, na.rm = TRUE)
)
print(skew_df)
## feature skewness
## age age 0.59970739
## gender gender -0.54850509
## height_cm height_cm -0.18700835
## weight_kg weight_kg 0.34977110
## body.fat_. body.fat_. 0.36127023
## diastolic diastolic -0.15976777
## systolic systolic -0.04733837
## gripForce gripForce 0.01830948
## sit.and.bend.forward_cm sit.and.bend.forward_cm 0.78550730
## sit.ups.counts sit.ups.counts -0.46765257
## broad.jump_cm broad.jump_cm -0.42269076
numeric_features <- c("age","gender","height_cm","weight_kg","body.fat_.",
"diastolic","systolic","gripForce",
"sit.and.bend.forward_cm","sit.ups.counts",
"broad.jump_cm")
plot_list <- lapply(numeric_features, function(col) {
ggplot(data, aes(x = .data[[col]])) +
geom_histogram(aes(y = after_stat(density)), bins = 30,
fill = "skyblue", color = "white") +
geom_density(color = "red", linewidth = 1) +
theme_minimal() +
labs(title = paste("Distribusi:", col))
})
gridExtra::grid.arrange(grobs = plot_list, ncol = 2)
# Barplot distribusi kelas
h <- barplot(table(data$class),
main = "Distribusi Kelas Kebugaran",
xlab = "Kelas",
ylab = "Jumlah Orang",
col = "skyblue",
border = "white")
text(x = h, y = table(data$class), labels = table(data$class), pos = 3, cex = 0.8)
# Boxplot visualisasi persebaran kelas (ordinal mapping)
boxplot(as.numeric(factor(data$class, levels=c("D","C","B","A"))) ,
main="Boxplot Tingkat Kebugaran", col="lightblue", horizontal=TRUE)
correlation_matrix <- cor(numeric_cols)
corrplot(correlation_matrix,
method = "color",
type = "upper",
tl.col = "black",
addCoef.col = "black",
number.cex = 0.7,
mar = c(0, 0, 2, 0),
title = "Korelasi antar Fitur Fisik")
Pada tahap akhir persiapan, kami menangani pencilan menggunakan metode Interquartile Range (IQR) agar model tidak bias. Kami juga melakukan transformasi logaritma pada variabel dengan skewness tinggi untuk mendekati distribusi normal.
# Imputasi dengan Median (jika ada nilai kosong)
numeric_indices <- sapply(data, is.numeric)
for (col in names(data)[numeric_indices]) {
if(any(is.na(data[[col]]))) {
data[[col]][is.na(data[[col]])] <- median(data[[col]], na.rm = TRUE)
cat("Imputasi median dilakukan untuk:", col, "\n")
}
}
handle_outliers_iqr <- function(x, multiplier = 1.5) {
q1 <- quantile(x, 0.25, na.rm = TRUE)
q3 <- quantile(x, 0.75, na.rm = TRUE)
iqr <- q3 - q1
lower_bound <- q1 - (multiplier * iqr)
upper_bound <- q3 + (multiplier * iqr)
x[x < lower_bound] <- lower_bound
x[x > upper_bound] <- upper_bound
return(x)
}
# Terapkan pada semua kolom numerik
for (col in names(data)[numeric_indices]) {
data[[col]] <- handle_outliers_iqr(data[[col]])
}
# Plot ulang untuk verifikasi
numeric_long <- data %>%
dplyr::select(where(is.numeric)) %>%
pivot_longer(everything(), names_to = "Feature", values_to = "Value")
ggplot(numeric_long, aes(x = Feature, y = Value)) +
geom_boxplot(fill = "#69b3a2") +
coord_flip() +
theme_minimal() +
labs(title = "Boxplot Setelah Penanganan Outlier")
# Mengatur class menjadi ordered factor (D < C < B < A)
data$class <- factor(data$class, levels = c("D", "C", "B", "A"), ordered = TRUE)
for (var in skew_df$feature) {
skew_val <- skew_df$skewness[skew_df$feature == var]
if (skew_val > 0.5) {
data[[paste0("trans_", var)]] <- log1p(data[[var]])
} else {
data[[paste0("trans_", var)]] <- data[[var]]
}
}
## Warning in log1p(data[[var]]): NaNs produced
fit_vif <- lm(as.numeric(class) ~ height_cm + weight_kg +
body.fat_. + gripForce + broad.jump_cm +
trans_age + trans_sit.ups.counts, data = data)
vif(fit_vif)
## height_cm weight_kg body.fat_.
## 4.244994 4.698870 3.412971
## gripForce broad.jump_cm trans_age
## 4.307117 4.310198 1.625180
## trans_sit.ups.counts
## 2.896176
VIF (Variance Inflation Factor) digunakan untuk mendeteksi adanya korelasi yang tinggi antar variabel bebas (independent) dalam model regresi.
Uji Hipotesis:
H0 : Tidak ada multikolinearitas signifikan antar variabel (VIF <= 5 atau 10). H1 : Terdapat multikolinearitas signifikan antar variabel (VIF > 5 atau 10). Kriteria Keputusan:
Jika semua nilai VIF < 5: Tidak ada indikasi multikolinearitas serius. Jika ada nilai VIF > 5 (moderate) atau > 10 (severe): Ada indikasi multikolinearitas.
Keputusan: Nilai VIF untuk semua fitur berada < 5 yang artinya tidak ada indikasi multikolinieritas yang serius. ## Untuk Model LDA ### Uji Normalitas Multivariat
# Ambil variabel numerik (gender DIHAPUS karena kategorik)
data_numeric <- data[, c("age","height_cm","weight_kg","body.fat_.",
"diastolic","systolic","gripForce",
"sit.and.bend.forward_cm","sit.ups.counts",
"broad.jump_cm")]
# Bersihkan data
data_numeric <- data_numeric[complete.cases(data_numeric), ]
# Hapus kolom dengan varians 0
data_numeric <- data_numeric[, sapply(data_numeric, function(x) var(x, na.rm = TRUE) > 0)]
# Sampling untuk efisiensi (opsional)
set.seed(123)
if (nrow(data_numeric) > 400) {
data_numeric <- data_numeric[sample(nrow(data_numeric), 400), ]
}
# Fungsi uji Mardia
mardia_test <- function(X){
n <- nrow(X)
p <- ncol(X)
S <- cov(X)
S_inv <- solve(S)
X_center <- scale(X, center = TRUE, scale = FALSE)
D <- X_center %*% S_inv %*% t(X_center)
# Skewness
b1p <- sum(D^3) / (n^2)
skew_stat <- n * b1p / 6
df_skew <- p * (p + 1) * (p + 2) / 6
p_skew <- 1 - pchisq(skew_stat, df_skew)
# Kurtosis
di <- diag(D)
b2p <- mean(di^2)
kurt_stat <- (b2p - p*(p+2)) / sqrt(8*p*(p+2)/n)
p_kurt <- 2 * (1 - pnorm(abs(kurt_stat)))
list(
skewness_stat = skew_stat,
skewness_p = p_skew,
kurtosis_stat = kurt_stat,
kurtosis_p = p_kurt
)
}
mardia_result <- mardia_test(data_numeric)
print(mardia_result)
## $skewness_stat
## [1] 498.5918
##
## $skewness_p
## [1] 0
##
## $kurtosis_stat
## [1] 4.925405
##
## $kurtosis_p
## [1] 8.418586e-07
# QQ Plot Mahalanobis Distance
dist_mahalanobis <- mahalanobis(
data_numeric,
colMeans(data_numeric),
cov(data_numeric)
)
n <- nrow(data_numeric)
p <- ncol(data_numeric)
qqplot(
qchisq(ppoints(n), df = p),
dist_mahalanobis,
xlab = "Theoretical Quantiles",
ylab = "Mahalanobis Distance",
pch = 19, cex = 0.6
)
abline(0, 1, col = "red", lwd = 2)
Mardia digunakan untuk menguji apakah data multivariat terdistribusi
normal. Uji ini berbasis pada perhitungan kemiringan (skewness) dan
keruncingan (kurtosis) data multivariat.
Uji Hipotesis:
H0 : Data mengikuti distribusi normal multivariat. H1 : Data tidak mengikuti distribusi normal multivariat. Kriteria Keputusan:
Jika p-value > 0.05, Terima H0, Data normal. Jika p-value <= 0.05, Tolak H0, Data tidak normal multivariat.
Ketuputsan: Tolak H0, Data tidak normal multivariat.
# Uji Box's M menggunakan kolom target 'class' yang sudah di-binning
data_box <- data[, c(
"age", "height_cm", "weight_kg", "body.fat_.",
"diastolic", "systolic", "gripForce",
"sit.and.bend.forward_cm", "sit.ups.counts",
"broad.jump_cm", "class"
)]
# Hapus NA
data_box <- data_box[complete.cases(data_box), ]
# Pisahkan fitur dan grup
X <- data_box[, 1:10]
group <- as.factor(data_box$class)
# Pastikan tidak ada level kosong
group <- droplevels(group)
# Cek distribusi kelas (WAJIB cek ini)
table(group)
## group
## D C B A
## 3349 3349 3347 3347
# Jalankan Box's M
box_test <- boxM(X, group)
# Output
print(box_test)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: X
## Chi-Sq (approx.) = 6082.8, df = 165, p-value < 2.2e-16
Box’s M adalah uji statistik yang digunakan untuk menguji asumsi homogenitas varians-kovarians dalam analisis MANOVA (Multivariate Analysis of Variance) atau analisis multivariat lainnya.
Uji Hipotesis:
H0 : Matriks kovarian antar kelompok adalah sama (homogen). H1 : Setidaknya satu matriks kovarian antar kelompok berbeda (tidak homogen). Kriteria Keputusan:
Jika p-value > 0.05, Terima H0, Homogen. Jika p-value <= 0.05, Tolak H0, Tidak homogen.
keputusan: Uji Homogenitas Box’s M menunjukkan nilai p-value kurang dari 0,05 yang mana keputusannya terima H0, atau matriks kovarian antar kelompok Tidak Homogen
df_model <- data[, c(
"class",
"trans_age",
"trans_height_cm",
"trans_weight_kg",
"trans_body.fat_.",
"trans_diastolic",
"trans_systolic",
"trans_gripForce",
"trans_sit.and.bend.forward_cm",
"trans_sit.ups.counts",
"trans_broad.jump_cm",
"gender"
)]
# CLEANING
df_model[df_model == Inf] <- NA
df_model[df_model == -Inf] <- NA
df_model <- na.omit(df_model)
df_model <- df_model[is.finite(rowSums(df_model[, sapply(df_model, is.numeric)])), ]
# SPLIT DATA
set.seed(123)
index <- sample(1:nrow(df_model), 0.7 * nrow(df_model))
train <- df_model[index, ]
test <- df_model[-index, ]
model_lda <- lda(class ~ ., data = train)
model_olr <- polr(class ~ ., data = train, Hess = TRUE)
pred_lda <- predict(model_lda, test)$class
pred_olr <- predict(model_olr, test)