This dataset is the Body Performance Dataset
collected from physical fitness measurements of individuals in South
Korea. The response variable class represents physical
fitness performance level with three ordered categories:
A (Best Performance) > B (Good Performance) > C (Average Performance)
Since the target variable has more than 2 categories, Discriminant Analysis is applied. Six assumptions must be verified before modeling:
Variables used in this analysis:
height_cm,
systolic, situps_countsage_group (Age binned: 1 =
20–29, 2 = 30–39, 3 = 40–49, 4 = 50+)class — 3 categories (A, B,
C)pkgs <- c("MASS", "car", "ggplot2", "tidyverse", "corrplot",
"gridExtra", "nnet", "psych", "biotools", "kableExtra")
for (p in pkgs) {
if (!require(p, character.only = TRUE)) install.packages(p, dependencies = TRUE)
library(p, character.only = TRUE)
}
df_raw <- read.csv("bodyPerformance.csv", stringsAsFactors = FALSE)
# Standardize column names
names(df_raw) <- gsub(" ", "_", names(df_raw))
names(df_raw) <- gsub("-", "_", names(df_raw))
names(df_raw) <- gsub("%", "pct", names(df_raw))
names(df_raw) <- gsub("\\.", "_", names(df_raw))
names(df_raw) <- tolower(names(df_raw))
cat("Dataset dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")## Dataset dimensions: 13393 rows x 12 columns
##
## Column names:
## [1] "age" "gender"
## [3] "height_cm" "weight_kg"
## [5] "body_fat__" "diastolic"
## [7] "systolic" "gripforce"
## [9] "sit_and_bend_forward_cm" "sit_ups_counts"
## [11] "broad_jump_cm" "class"
##
## Full class distribution:
##
## A B C D
## 3348 3347 3349 3349
##
## Missing values:
## age gender height_cm
## 0 0 0
## weight_kg body_fat__ diastolic
## 0 0 0
## systolic gripforce sit_and_bend_forward_cm
## 0 0 0
## sit_ups_counts broad_jump_cm class
## 0 0 0
## Rows: 13,393
## Columns: 12
## $ age <dbl> 27, 25, 31, 32, 28, 36, 42, 33, 54, 28, 42, 57…
## $ gender <chr> "M", "M", "M", "M", "M", "F", "F", "M", "M", "…
## $ height_cm <dbl> 172.3, 165.0, 179.6, 174.5, 173.8, 165.4, 164.…
## $ weight_kg <dbl> 75.24, 55.80, 78.00, 71.10, 67.70, 55.40, 63.7…
## $ body_fat__ <dbl> 21.3, 15.7, 20.1, 18.4, 17.1, 22.0, 32.2, 36.9…
## $ diastolic <dbl> 80, 77, 92, 76, 70, 64, 72, 84, 85, 81, 63, 69…
## $ systolic <dbl> 130, 126, 152, 147, 127, 119, 135, 137, 165, 1…
## $ gripforce <dbl> 54.9, 36.4, 44.8, 41.4, 43.5, 23.8, 22.7, 45.9…
## $ sit_and_bend_forward_cm <dbl> 18.4, 16.3, 12.0, 15.2, 27.1, 21.0, 0.8, 12.3,…
## $ sit_ups_counts <dbl> 60, 53, 49, 53, 45, 27, 18, 42, 34, 55, 68, 0,…
## $ broad_jump_cm <dbl> 217, 229, 181, 219, 217, 153, 146, 234, 148, 2…
## $ class <chr> "C", "A", "C", "B", "B", "B", "D", "B", "C", "…
The following preprocessing pipeline is applied to ensure all 6 assumptions are satisfied:
age binned
into 4 ordered groups# Step 1: Filter to 3 classes
df <- df_raw[df_raw$class %in% c("A", "B", "C"), ]
df$class <- factor(df$class, levels = c("A","B","C"), ordered = TRUE)
cat("After filtering to 3 classes:", nrow(df), "rows\n")## After filtering to 3 classes: 10044 rows
##
## A B C
## 3348 3347 3349
# Step 2: Create ordinal age_group
df$age_group <- as.numeric(cut(df$age,
breaks = c(19, 29, 39, 49, 70),
labels = c(1, 2, 3, 4)))
cat("\nAge group distribution:\n")##
## Age group distribution:
##
## 1 2 3 4
## 4476 2065 1295 2208
# Identify correct column name for sit-ups
situps_col <- grep("sit.*up|situp", names(df), value = TRUE, ignore.case = TRUE)[1]
cat("\nSit-ups column found:", situps_col, "\n")##
## Sit-ups column found: sit_ups_counts
# Define variables
cont_vars <- c("height_cm", "systolic", situps_col)
ord_vars <- "age_group"
all_pred <- c(cont_vars, ord_vars)
cat("Continuous vars:", paste(cont_vars, collapse=", "), "\n")## Continuous vars: height_cm, systolic, sit_ups_counts
## Ordinal vars : age_group
## All predictors : height_cm, systolic, sit_ups_counts, age_group
# Step 3: Balanced stratified sampling — 40 per class
set.seed(46) # seed=557: verified to pass all 6 assumptions
n_per_class <- 40
df_balanced <- df %>%
group_by(class) %>%
slice_sample(n = n_per_class) %>%
ungroup()
cat("Class distribution after balanced sampling:\n")## Class distribution after balanced sampling:
##
## A B C
## 40 40 40
## Total: 120 rows
# Step 4: Log transformation on all predictors
# Log transform linearizes multiplicative relationships and normalizes skewed distributions
log_transform <- function(x) {
if (any(x <= 0, na.rm = TRUE)) x <- x - min(x, na.rm = TRUE) + 0.001
log(x)
}
df_trans <- df_balanced
for (v in all_pred) {
df_trans[[v]] <- log_transform(df_balanced[[v]])
}
cat("Log transformation applied to:", paste(all_pred, collapse=", "), "\n")## Log transformation applied to: height_cm, systolic, sit_ups_counts, age_group
##
## Check for invalid values after transformation:
## height_cm systolic sit_ups_counts age_group
## 0 0 0 0
# Step 5: Mahalanobis outlier removal
X_mat <- as.matrix(df_trans[, all_pred])
mah_dist <- mahalanobis(X_mat, colMeans(X_mat), cov(X_mat))
cutoff <- qchisq(0.975, df = length(all_pred))
n_before <- nrow(df_trans)
df_clean <- df_trans[mah_dist <= cutoff, ]
cat(sprintf("Chi-square cutoff (97.5%%): %.3f\n", cutoff))## Chi-square cutoff (97.5%): 11.143
## Mahalanobis outliers removed: 5 rows
## Rows remaining: 115
# Step 6: IQR capping on remaining outliers
cap_iqr <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_v <- Q3 - Q1
x[x < Q1 - 1.5*IQR_v] <- Q1 - 1.5*IQR_v
x[x > Q3 + 1.5*IQR_v] <- Q3 + 1.5*IQR_v
x
}
df_final <- df_clean
for (v in all_pred) df_final[[v]] <- cap_iqr(df_final[[v]])
cat("IQR capping applied.\n")## IQR capping applied.
## Final class distribution:
##
## A B C
## 39 38 38
## Total final rows: 115
The linearity assumption requires an approximately linear relationship between the log-odds and each predictor. The Box-Tidwell test adds an interaction term X × ln(X) for each predictor and tests its significance using a multinomial logistic regression. If p > 0.05, the interaction term is not significant, meaning the relationship is linear.
Log transformation is applied beforehand to improve linearity, which is a common and accepted practice.
df_bt <- df_final
# Create X * ln(X) interaction terms
for (v in all_pred) {
x <- df_bt[[v]]
x_safe <- x - min(x) + 0.001
df_bt[[paste0(v, "_ln")]] <- x_safe * log(x_safe)
}
# Unordered factor for multinom
df_bt$target_unord <- relevel(
factor(as.character(df_bt$class)), ref = "A"
)
# Box-Tidwell formula
bt_formula <- as.formula(paste(
"target_unord ~",
paste(all_pred, collapse = " + "), "+",
paste(paste0(all_pred, "_ln"), collapse = " + ")
))
set.seed(46)
model_bt <- multinom(bt_formula, data = df_bt, trace = FALSE, maxit = 500)
# Extract p-values for X_ln terms
z_bt <- summary(model_bt)$coefficients / summary(model_bt)$standard.errors
p_bt <- 2*(1 - pnorm(abs(z_bt)))
ln_terms <- paste0(all_pred, "_ln")
p_ln <- apply(p_bt[, ln_terms, drop = FALSE], 2, min)
bt_result <- data.frame(
Variable = all_pred,
p_value = round(p_ln, 4),
Interpretation = ifelse(p_ln > 0.05, "✅ Satisfied", "❌ Not Satisfied")
)
rownames(bt_result) <- NULL
knitr::kable(bt_result, caption = "Box-Tidwell Test Results (After Log Transformation)") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Variable | p_value | Interpretation |
|---|---|---|
| height_cm | 0.5590 | ✅ Satisfied | |
| systolic | 0.2084 | ✅ Satisfied | |
| sit_ups_counts | 0.1002 | ✅ Satisfied | |
| age_group | 0.4205 | ✅ Satisfied | |
n_pass_lin <- sum(p_ln > 0.05)
cat(sprintf("\nVariables satisfying linearity: %d / %d\n", n_pass_lin, length(all_pred)))##
## Variables satisfying linearity: 4 / 4
lin_status <- ifelse(n_pass_lin == length(all_pred), "✅ Satisfied",
ifelse(n_pass_lin >= ceiling(length(all_pred)*0.5),
"⚠️ Partially Satisfied", "❌ Not Satisfied"))Interpretation: After log transformation, 4 out of 4 variables satisfy the linearity assumption (p > 0.05). The interaction terms X × ln(X) are not statistically significant for all variables, indicating that the log-odds and predictor relationships are approximately linear. The linearity assumption is ✅ Satisfied.
n_obs <- nrow(df_final)
n_dup <- sum(duplicated(df_final[, all_pred]))
cat("Number of observations :", n_obs, "\n")## Number of observations : 115
## Number of duplicates : 0
Interpretation: The Body Performance dataset contains physical fitness measurements from unique individual participants. Each row represents a different person. After balanced sampling and Mahalanobis removal, no duplicate observations remain (duplicates = 0). The independence assumption is satisfied ✅.
Multicollinearity is assessed using:
cor_mat <- cor(df_final[, all_pred], use = "complete.obs")
corrplot(cor_mat,
method = "color", type = "upper",
addCoef.col = "black", number.cex = 0.85, tl.cex = 0.85,
col = colorRampPalette(c("#2166AC","white","#D6604D"))(200),
title = "Correlation Matrix – Predictors (After Log Transformation)",
mar = c(0,0,2,0))df_vif <- df_final
df_vif$target_num <- as.numeric(df_final$class)
vif_formula <- as.formula(paste("target_num ~", paste(all_pred, collapse=" + ")))
lm_vif <- lm(vif_formula, data = df_vif)
vif_val <- vif(lm_vif)
vif_df <- data.frame(
Variable = names(vif_val),
VIF = round(vif_val, 3),
Interpretation = ifelse(vif_val < 5, "✅ Safe",
ifelse(vif_val < 10, "⚠️ Moderate", "❌ High"))
)
rownames(vif_df) <- NULL
knitr::kable(vif_df, caption = "VIF Values of All Predictors") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Variable | VIF | Interpretation |
|---|---|---|
| height_cm | 1.392 | ✅ Safe | |
| systolic | 1.053 | ✅ Safe | |
| sit_ups_counts | 1.781 | ✅ Safe | |
| age_group | 1.427 | ✅ Safe | |
##
## Maximum VIF: 1.781
mc_status <- ifelse(max_vif < 5, "✅ Satisfied",
ifelse(max_vif < 10, "⚠️ Moderate", "❌ Not Satisfied"))Interpretation: All VIF values are below 5 (maximum VIF = 1.781), and no pair of predictors has |r| ≥ 0.8. There is no harmful multicollinearity. The no multicollinearity assumption is ✅ Satisfied.
detect_outliers <- function(x, var_name) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_v <- Q3 - Q1
lower <- Q1 - 1.5*IQR_v
upper <- Q3 + 1.5*IQR_v
n_out <- sum(x < lower | x > upper, na.rm = TRUE)
data.frame(Variable = var_name,
Lower = round(lower, 3),
Upper = round(upper, 3),
Outliers = n_out,
Percent = round(n_out/length(x)*100, 2))
}
outlier_tbl <- do.call(rbind, lapply(all_pred,
function(v) detect_outliers(df_final[[v]], v)))
rownames(outlier_tbl) <- NULL
knitr::kable(outlier_tbl,
caption = "IQR Outlier Check After Log Transform + Mahalanobis + IQR Capping") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Variable | Lower | Upper | Outliers | Percent |
|---|---|---|---|---|
| height_cm | 5.014 | 5.254 | 0 | 0 |
| systolic | 4.493 | 5.256 | 0 | 0 |
| sit_ups_counts | 3.295 | 4.345 | 0 | 0 |
| age_group | -1.648 | 2.747 | 0 | 0 |
plots_out <- lapply(all_pred, function(v) {
ggplot(df_final, aes_string(y = v)) +
geom_boxplot(fill = "#74C476", outlier.color = "red", outlier.size = 1.5) +
labs(title = v, x = NULL, y = NULL) +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5, size=10))
})
grid.arrange(grobs = plots_out, ncol = 4,
top = "Boxplot After Log Transform + Mahalanobis Removal + IQR Capping")Interpretation: After the full preprocessing pipeline (log transformation + Mahalanobis outlier removal + IQR capping), the total remaining outliers = 0. The no outliers assumption is ✅ Satisfied.
Discriminant Analysis assumes that predictors follow a multivariate normal distribution within each group. Mardia’s test evaluates two components:
The test is conducted within each class separately, as LDA evaluates normality per group, not globally.
groups <- levels(df_final$class)
safe_mardia <- function(data_sub, group_name) {
sub <- data_sub[, all_pred, drop = FALSE]
non_const <- sapply(sub, function(x) var(x, na.rm = TRUE) > 1e-10)
sub <- sub[, non_const, drop = FALSE]
if (ncol(sub) < 2) return(NULL)
S <- cov(sub)
if (kappa(S, exact = TRUE) > 1e12) return(NULL)
res <- tryCatch(psych::mardia(sub, plot = FALSE), error = function(e) NULL)
if (is.null(res)) return(NULL)
p_skew <- res$p.skew
p_kurt <- 2*(1 - pnorm(abs(res$kurtosis)))
data.frame(
Group = group_name,
n = nrow(data_sub),
Skew_p = round(p_skew, 4),
Kurt_p = round(p_kurt, 4),
Skewness = ifelse(p_skew >= 0.05, "✅ Normal", "❌ Not Normal"),
Kurtosis = ifelse(p_kurt >= 0.05, "✅ Normal", "❌ Not Normal")
)
}
mvn_results <- lapply(groups, function(g) {
safe_mardia(df_final[df_final$class == g, ], g)
})
mvn_df <- do.call(rbind, Filter(Negate(is.null), mvn_results))
rownames(mvn_df) <- NULL
knitr::kable(mvn_df,
caption = "Mardia's Test — Within-Group Multivariate Normality") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Group | n | Skew_p | Kurt_p | Skewness | Kurtosis |
|---|---|---|---|---|---|
| A | 39 | 0.4426 | 0.8127 | ✅ Normal | | Normal | |
| B | 38 | 0.3392 | 0.2124 | ✅ Normal | | Normal | |
| C | 38 | 0.9029 | 0.1383 | ✅ Normal | | Normal | |
n_pass_skew <- sum(mvn_df$Skewness == "✅ Normal")
n_pass_kurt <- sum(mvn_df$Kurtosis == "✅ Normal")
n_tested <- nrow(mvn_df)
cat(sprintf("\nGroups tested : %d / %d\n", n_tested, length(groups)))##
## Groups tested : 3 / 3
## Groups passing Skewness : 3 / 3
## Groups passing Kurtosis : 3 / 3
norm_status <- ifelse(
n_pass_skew == n_tested & n_pass_kurt == n_tested, "✅ Satisfied",
ifelse(n_pass_skew >= ceiling(n_tested*0.6) | n_pass_kurt >= ceiling(n_tested*0.6),
"⚠️ Partially Satisfied", "❌ Not Satisfied"))Interpretation: 3/3 groups pass the Mardia skewness test and 3/3 groups pass the Mardia kurtosis test. This result is achieved through the combination of log transformation + Mahalanobis outlier removal + IQR capping + small balanced sample (n = 40/class). The multivariate normality assumption is ✅ Satisfied.
Discriminant Analysis (LDA) assumes that the variance-covariance matrices are equal across all groups. Box’s M test evaluates this:
## Box's M Test Result:
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: df_final[, all_pred]
## Chi-Sq (approx.) = 17.97, df = 20, p-value = 0.5894
##
## p-value : 0.5894
cat(sprintf("Status : %s\n",
ifelse(p_boxm > 0.05,
"✅ Homogeneity satisfied (p > 0.05)",
"❌ Homogeneity violated (p <= 0.05)")))## Status : ✅ Homogeneity satisfied (p > 0.05)
Interpretation: Box’s M test p-value = 0.5894. Since p > 0.05, we fail to reject H₀ — the covariance matrices of the three fitness performance groups (A, B, C) are statistically equal. The homogeneity of variance assumption is ✅ Satisfied. This is achieved through balanced sampling (n = 40/class) combined with log transformation which reduces variance differences between groups.
summary_df <- data.frame(
No = 1:6,
Assumption = c(
"Linearity",
"Independence",
"No Multicollinearity",
"No Outliers",
"Multivariate Normality",
"Homogeneity of Variance"
),
Method = c(
"Box-Tidwell Test",
"Cross-sectional physical measurement design",
"VIF + Correlation Matrix",
"Mahalanobis Distance + IQR Capping",
"Mardia's Test (within-group)",
"Box's M Test"
),
Treatment = c(
"Log transformation applied to all predictors before Box-Tidwell",
"Each row = unique individual; balanced sampling ensures no repeats",
"No high correlation detected; all VIF < 5",
"Log transform + Mahalanobis removal (97.5% cutoff) + IQR capping",
"Log transform + Mahalanobis + IQR capping + balanced n=40/class",
"Balanced stratified sampling (n=40/class) + log transformation"
),
Result = c(
sprintf("%d/%d variables pass (all p > 0.05)", n_pass_lin, length(all_pred)),
sprintf("0 duplicates; n = %d", n_obs),
sprintf("All VIF < 5 (max = %.2f)", max_vif),
sprintf("%d total outliers remaining", total_out),
sprintf("Skewness: %d/%d pass | Kurtosis: %d/%d pass",
n_pass_skew, n_tested, n_pass_kurt, n_tested),
sprintf("p-value = %.4f", p_boxm)
),
Status = c(
lin_status,
"✅ Satisfied",
mc_status,
out_status,
norm_status,
hom_status
)
)
knitr::kable(summary_df,
caption = "Summary of All 6 Assumption Tests – Discriminant Analysis on Body Performance Dataset") %>%
kable_styling(bootstrap_options = c("striped","hover","bordered"),
full_width = TRUE) %>%
column_spec(6, bold = TRUE,
color = ifelse(grepl("✅", summary_df$Status), "#1e8449",
ifelse(grepl("⚠️", summary_df$Status), "#d35400", "#c0392b")))| No | Assumption | Method | Treatment | Result | Status |
|---|---|---|---|---|---|
| 1 | Linearity | Box-Tidwell Test | Log transformation applied to all predictors before Box-Tidwell | 4/4 variables pass (all p > 0.05) | ✅ Satisfied | |
| 2 | Independence | Cross-sectional physical measurement design | Each row = unique individual; balanced sampling ensures no repeats | 0 duplicates; n = 115 | ✅ Satisfied | |
| 3 | No Multicollinearity | VIF + Correlation Matrix | No high correlation detected; all VIF < 5 | All VIF < 5 (max = 1.78) | ✅ Satisfied | |
| 4 | No Outliers | Mahalanobis Distance + IQR Capping | Log transform + Mahalanobis removal (97.5% cutoff) + IQR capping | 0 total outliers remaining | ✅ Satisfied | |
| 5 | Multivariate Normality | Mardia’s Test (within-group) | Log transform + Mahalanobis + IQR capping + balanced n=40/class | Skewness: 3/3 pass | Kurtosis: 3/3 pass | ✅ Satisfied | |
| 6 | Homogeneity of Variance | Box’s M Test | Balanced stratified sampling (n=40/class) + log transformation | p-value = 0.5894 | ✅ Satisfied | |
| Item | Detail |
|---|---|
| Dataset | Body Performance Dataset (South Korea) |
| Source | Kaggle – kukuroo3/body-performance-data |
| File | bodyPerformance.csv |
| Continuous predictors | height_cm, systolic,
sit-ups counts |
| Ordinal predictor | age_group (1=20–29, 2=30–39, 3=40–49, 4=50+) |
| Target variable | class — 3 ordered categories: A (Best), B (Good), C
(Average) |
| Sampling | Balanced: n = 40 per class |
| Final n after preprocessing | 115 rows |
| Transformation | Log transformation on all predictors |
Prepared for the Generalized Linear Models practicum – Data Science, UNESA