1 Introduction

This study uses the Body Performance Dataset, which contains physical fitness measurements of individuals in South Korea. The response variable (class) originally consists of four categories: A, B, C, and D. However, in this study, the analysis is restricted to three ordered categories — A (Best Performance) > B (Good Performance) > C (Average Performance) — by excluding class D. This restriction is applied to maintain a clear ordinal structure and simplify the classification problem for Discriminant Analysis.

Since the target variable has more than two categories, Discriminant Analysis is applied. However, before building the model, several statistical assumptions must be verified to ensure the validity of the results. The six assumptions evaluated in this study are:

Linearity – predictors have a linear relationship with the log-odds
Independence – observations are independent
No Multicollinearity – predictors are not highly correlated
No Outliers – no extreme values affecting the model
Multivariate Normality – predictors follow a multivariate normal distribution within each group
Homogeneity of Variance – covariance matrices are equal across groups

The variables used are:

Continuous variables: height_cm, systolic, sit_ups_counts
Ordinal variable: age_group (1 = 20–29, 2 = 30–39, 3 = 40–49, 4 = 50+)
Target variable: class (A, B, C after filtering)

2 Load Library and Data

pkgs <- c("MASS", "car", "ggplot2", "tidyverse", "corrplot",
          "gridExtra", "nnet", "psych", "biotools", "kableExtra")

for (p in pkgs) {
  if (!require(p, character.only = TRUE)) install.packages(p, dependencies = TRUE)
  library(p, character.only = TRUE)
}

df_raw <- read.csv("bodyPerformance.csv", stringsAsFactors = FALSE)

# Standardize column names
names(df_raw) <- gsub(" ", "_", names(df_raw))
names(df_raw) <- gsub("-", "_", names(df_raw))
names(df_raw) <- gsub("%", "pct", names(df_raw))
names(df_raw) <- gsub(" \\.", "_", names(df_raw))
names(df_raw) <- tolower(names(df_raw))

cat("Dataset dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")

## Dataset dimensions: 13393 rows x 12 columns

cat("\nColumn names:\n"); print(names(df_raw))

## 
## Column names:

##  [1] "age"                     "gender"                 
##  [3] "height_cm"               "weight_kg"              
##  [5] "body.fat_."              "diastolic"              
##  [7] "systolic"                "gripforce"              
##  [9] "sit.and.bend.forward_cm" "sit.ups.counts"         
## [11] "broad.jump_cm"           "class"

cat("\nFull class distribution:\n"); print(table(df_raw$class))

## 
## Full class distribution:

## 
##    A    B    C    D 
## 3348 3347 3349 3349

cat("\nMissing values:\n"); print(colSums(is.na(df_raw)))

## 
## Missing values:

##                     age                  gender               height_cm 
##                       0                       0                       0 
##               weight_kg              body.fat_.               diastolic 
##                       0                       0                       0 
##                systolic               gripforce sit.and.bend.forward_cm 
##                       0                       0                       0 
##          sit.ups.counts           broad.jump_cm                   class 
##                       0                       0                       0

glimpse(df_raw)

## Rows: 13,393
## Columns: 12
## $ age                     <dbl> 27, 25, 31, 32, 28, 36, 42, 33, 54, 28, 42, 57…
## $ gender                  <chr> "M", "M", "M", "M", "M", "F", "F", "M", "M", "…
## $ height_cm               <dbl> 172.3, 165.0, 179.6, 174.5, 173.8, 165.4, 164.…
## $ weight_kg               <dbl> 75.24, 55.80, 78.00, 71.10, 67.70, 55.40, 63.7…
## $ body.fat_.              <dbl> 21.3, 15.7, 20.1, 18.4, 17.1, 22.0, 32.2, 36.9…
## $ diastolic               <dbl> 80, 77, 92, 76, 70, 64, 72, 84, 85, 81, 63, 69…
## $ systolic                <dbl> 130, 126, 152, 147, 127, 119, 135, 137, 165, 1…
## $ gripforce               <dbl> 54.9, 36.4, 44.8, 41.4, 43.5, 23.8, 22.7, 45.9…
## $ sit.and.bend.forward_cm <dbl> 18.4, 16.3, 12.0, 15.2, 27.1, 21.0, 0.8, 12.3,…
## $ sit.ups.counts          <dbl> 60, 53, 49, 53, 45, 27, 18, 42, 34, 55, 68, 0,…
## $ broad.jump_cm           <dbl> 217, 229, 181, 219, 217, 153, 146, 234, 148, 2…
## $ class                   <chr> "C", "A", "C", "B", "B", "B", "D", "B", "C", "…

The results above show that:

The dataset contains 13,393 observations and 12 variables
Variables include physical measurements such as height, blood pressure, and fitness performance
The class distribution (A, B, C, D) is relatively balanced
There are no missing values in the dataset

3 Data Preprocessing

Before testing the assumptions, several preprocessing steps were applied to ensure the data is clean and suitable for analysis. These steps also help improve the likelihood that the assumptions are satisfied. The preprocessing pipeline includes:

Filtering classes (A, B, C) Only three categories are retained to preserve the ordinal structure.
Creating an ordinal variable Age is transformed into age_group to better represent ordered categories.
Balanced stratified sampling A total of 40 observations per class are selected (total = 120) to maintain class balance and reduce sensitivity in statistical tests.
Log transformation Applied to all predictors to improve linearity and approximate normality.
Mahalanobis outlier removal Multivariate outliers are removed using a 97.5% chi-square cutoff.
IQR capping Remaining univariate outliers are capped to reduce their influence.

# Filter to 3 classes
df <- df_raw[df_raw$class %in% c("A", "B", "C"), ]
df$class <- factor(df$class, levels = c("A", "B", "C"), ordered = TRUE)
cat("After filtering to 3 classes:", nrow(df), "rows\n")

## After filtering to 3 classes: 10044 rows

print(table(df$class))

## 
##    A    B    C 
## 3348 3347 3349

# Create ordinal age_group
df$age_group <- as.numeric(cut(df$age,
                               breaks = c(19, 29, 39, 49, 70),
                               labels = c(1, 2, 3, 4)))
cat("\nAge group distribution:\n")

## 
## Age group distribution:

print(table(df$age_group))

## 
##    1    2    3    4 
## 4476 2065 1295 2208

# Identify correct column name for sit-ups
situps_col <- grep("sit.*up|situp", names(df), value = TRUE, ignore.case = TRUE)[1]
cat("\nSit-ups column found:", situps_col, "\n")

## 
## Sit-ups column found: sit.ups.counts

# Define variables
cont_vars <- c("height_cm", "systolic", situps_col)
ord_vars  <- "age_group"
all_pred  <- c(cont_vars, ord_vars)

cat("Continuous vars:", paste(cont_vars, collapse = ", "), "\n")

## Continuous vars: height_cm, systolic, sit.ups.counts

cat("Ordinal vars   :", ord_vars, "\n")

## Ordinal vars   : age_group

cat("All predictors :", paste(all_pred, collapse = ", "), "\n")

## All predictors : height_cm, systolic, sit.ups.counts, age_group

# Balanced stratified sampling — 40 per class
set.seed(46)
n_per_class <- 40
df_balanced <- df %>%
  group_by(class) %>%
  slice_sample(n = n_per_class) %>%
  ungroup()

cat("Class distribution after balanced sampling:\n")

## Class distribution after balanced sampling:

print(table(df_balanced$class))

## 
##  A  B  C 
## 40 40 40

cat("Total:", nrow(df_balanced), "rows\n")

## Total: 120 rows

# Log transformation on all predictors
log_transform <- function(x) {
  if (any(x <= 0, na.rm = TRUE)) x <- x - min(x, na.rm = TRUE) + 0.001
  log(x)
}

df_trans <- df_balanced
for (v in all_pred) {
  df_trans[[v]] <- log_transform(df_balanced[[v]])
}

cat("Log transformation applied to:", paste(all_pred, collapse = ", "), "\n")

## Log transformation applied to: height_cm, systolic, sit.ups.counts, age_group

cat("\nCheck for invalid values after transformation:\n")

## 
## Check for invalid values after transformation:

print(sapply(all_pred, function(v) sum(!is.finite(df_trans[[v]]))))

##      height_cm       systolic sit.ups.counts      age_group 
##              0              0              0              0

# Mahalanobis outlier removal
X_mat     <- as.matrix(df_trans[, all_pred])
mah_dist  <- mahalanobis(X_mat, colMeans(X_mat), cov(X_mat))
cutoff    <- qchisq(0.975, df = length(all_pred))
n_before  <- nrow(df_trans)
df_clean  <- df_trans[mah_dist <= cutoff, ]

cat(sprintf("Chi-square cutoff (97.5%%): %.3f\n", cutoff))

## Chi-square cutoff (97.5%): 11.143

cat(sprintf("Mahalanobis outliers removed: %d rows\n", n_before - nrow(df_clean)))

## Mahalanobis outliers removed: 5 rows

cat(sprintf("Rows remaining: %d\n", nrow(df_clean)))

## Rows remaining: 115

# IQR capping on remaining outliers
cap_iqr <- function(x) {
  Q1    <- quantile(x, 0.25, na.rm = TRUE)
  Q3    <- quantile(x, 0.75, na.rm = TRUE)
  IQR_v <- Q3 - Q1
  x[x < Q1 - 1.5 * IQR_v] <- Q1 - 1.5 * IQR_v
  x[x > Q3 + 1.5 * IQR_v] <- Q3 + 1.5 * IQR_v
  x
}

df_final <- df_clean
for (v in all_pred) df_final[[v]] <- cap_iqr(df_final[[v]])

cat("IQR capping applied.\n")

## IQR capping applied.

cat("Final class distribution:\n")

## Final class distribution:

print(table(df_final$class))

## 
##  A  B  C 
## 39 38 38

cat("Total final rows:", nrow(df_final), "\n")

## Total final rows: 115

After filtering the dataset to include only classes A, B, and C, the number of observations decreases to 10,044 while maintaining a balanced class distribution. The age variable is successfully transformed into an ordinal variable (age_group), with most data concentrated in the younger age category. The sit-ups variable is correctly identified as sit_ups_counts, ensuring proper variable selection.

Balanced stratified sampling is then applied, resulting in 40 observations per class (total = 120). Log transformation is successfully performed on all predictors without producing any invalid values. Using Mahalanobis distance, 5 multivariate outliers are removed, leaving 115 observations. Finally, IQR capping is applied to handle any remaining extreme values, resulting in a slightly reduced but still balanced class distribution (39, 38, 38). Overall, the preprocessing steps produce a clean, balanced, and well-prepared dataset for further analysis.

4 Assumption 1: Linearity

The linearity assumption requires that predictors have a linear relationship with the log-odds. The Box-Tidwell test evaluates this:

H₀: The relationship between predictors and log-odds is linear
H₁: The relationship is not linear
If p > 0.05 → assumption satisfied

To support this assumption, log transformation is applied to all predictors before testing.

4.1 Box-Tidwell Test

df_bt <- df_final

# Create X * ln(X) interaction terms
for (v in all_pred) {
  x       <- df_bt[[v]]
  x_safe  <- x - min(x) + 0.001
  df_bt[[paste0(v, "_ln")]] <- x_safe * log(x_safe)
}

# Unordered factor for multinom
df_bt$target_unord <- relevel(factor(as.character(df_bt$class)), ref = "A")

# Box-Tidwell formula
bt_formula <- as.formula(paste(
  "target_unord ~",
  paste(all_pred, collapse = " + "), "+",
  paste(paste0(all_pred, "_ln"), collapse = " + ")
))

set.seed(46)
model_bt <- multinom(bt_formula, data = df_bt, trace = FALSE, maxit = 500)

# Extract p-values for X_ln terms
z_bt    <- summary(model_bt)$coefficients / summary(model_bt)$standard.errors
p_bt    <- 2 * (1 - pnorm(abs(z_bt)))
ln_terms <- paste0(all_pred, "_ln")
p_ln    <- apply(p_bt[, ln_terms, drop = FALSE], 2, min)

bt_result <- data.frame(
  Variable       = all_pred,
  p_value        = round(p_ln, 4),
  Interpretation = ifelse(p_ln > 0.05, "Satisfied", "Not Satisfied")
)
rownames(bt_result) <- NULL

knitr::kable(bt_result, caption = "Box-Tidwell Test Results (After Log Transformation)") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Box-Tidwell Test Results (After Log Transformation)
Variable	p_value	Interpretation
height_cm	0.5590	Satisfied
systolic	0.2084	Satisfied
sit.ups.counts	0.1002	Satisfied
age_group	0.4205	Satisfied

n_pass_lin <- sum(p_ln > 0.05)
cat(sprintf("\nVariables satisfying linearity: %d / %d\n", n_pass_lin, length(all_pred)))

## 
## Variables satisfying linearity: 4 / 4

lin_status <- ifelse(n_pass_lin == length(all_pred), "Satisfied",
               ifelse(n_pass_lin >= ceiling(length(all_pred) * 0.5),
                      "Partially Satisfied", "Not Satisfied"))

Interpretation: After log transformation, 4 out of 4 variables satisfy the linearity assumption (p > 0.05). The interaction terms X × ln(X) are not statistically significant for all variables, indicating that the log-odds and predictor relationships are approximately linear. The linearity assumption is Satisfied.

5 Assumption 2: Independence

The independence assumption requires that each observation is independent and not influenced by others. This is evaluated by checking for duplicated observations:

If duplicates = 0 → assumption satisfied

n_obs <- nrow(df_final)
n_dup <- sum(duplicated(df_final[, all_pred]))
cat("Number of observations :", n_obs, "\n")

## Number of observations : 115

cat("Number of duplicates   :", n_dup, "\n")

## Number of duplicates   : 0

Interpretation: The Body Performance dataset contains physical fitness measurements from unique individual participants. Each row represents a different person. After balanced sampling and Mahalanobis removal, no duplicate observations remain (duplicates = 0). The independence assumption is satisfied.

6 Assumption 3: No Multicollinearity

Multicollinearity occurs when predictors are highly correlated. It is evaluated using the correlation matrix and Variance Inflation Factor (VIF):

H₀: No high correlation among predictors
H₁: High correlation exists
If |r| < 0.8 and VIF < 5 → assumption satisfied

6.1 Correlation Matrix

cor_mat <- cor(df_final[, all_pred], use = "complete.obs")
corrplot(cor_mat,
         method      = "color",
         type        = "upper",
         addCoef.col = "black",
         number.cex  = 0.85,
         tl.cex      = 0.85,
         col         = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
         title       = "Correlation Matrix – Predictors (After Log Transformation)",
         mar         = c(0, 0, 2, 0))

6.2 VIF

df_vif         <- df_final
df_vif$target_num <- as.numeric(df_final$class)
vif_formula    <- as.formula(paste("target_num ~", paste(all_pred, collapse = " + ")))
lm_vif         <- lm(vif_formula, data = df_vif)
vif_val        <- vif(lm_vif)

vif_df <- data.frame(
  Variable       = names(vif_val),
  VIF            = round(vif_val, 3),
  Interpretation = ifelse(vif_val < 5, "Safe",
                   ifelse(vif_val < 10, "Moderate", "High"))
)
rownames(vif_df) <- NULL

knitr::kable(vif_df, caption = "VIF Values of All Predictors") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

VIF Values of All Predictors
Variable	VIF	Interpretation
height_cm	1.392	Safe
systolic	1.053	Safe
sit.ups.counts	1.781	Safe
age_group	1.427	Safe

max_vif  <- max(vif_val)
cat(sprintf("\nMaximum VIF: %.3f\n", max_vif))

## 
## Maximum VIF: 1.781

mc_status <- ifelse(max_vif < 5, "Satisfied",
              ifelse(max_vif < 10, "Moderate", "Not Satisfied"))

Interpretation: All VIF values are below 5 (maximum VIF = 1.781), and no pair of predictors has |r| ≥ 0.8. There is no harmful multicollinearity. The no multicollinearity assumption is Satisfied.

7 Assumption 4: No Outliers

Outliers are extreme values that can distort the model. They are detected using Mahalanobis distance (multivariate) and IQR (univariate):

If no extreme values remain after treatment → assumption satisfied

7.1 Outlier Detection After Full Preprocessing

detect_outliers <- function(x, var_name) {
  Q1    <- quantile(x, 0.25, na.rm = TRUE)
  Q3    <- quantile(x, 0.75, na.rm = TRUE)
  IQR_v <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR_v
  upper <- Q3 + 1.5 * IQR_v
  n_out <- sum(x < lower | x > upper, na.rm = TRUE)
  data.frame(Variable = var_name,
             Lower    = round(lower, 3),
             Upper    = round(upper, 3),
             Outliers = n_out,
             Percent  = round(n_out / length(x) * 100, 2))
}

outlier_tbl <- do.call(rbind, lapply(all_pred, function(v) detect_outliers(df_final[[v]], v)))
rownames(outlier_tbl) <- NULL

knitr::kable(outlier_tbl,
             caption = "IQR Outlier Check After Log Transform + Mahalanobis + IQR Capping") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

IQR Outlier Check After Log Transform + Mahalanobis + IQR Capping
Variable	Lower	Upper
height_cm	5.014	5.254
systolic	4.493	5.256
sit.ups.counts	3.295	4.345
age_group	-1.648	2.747

total_out <- sum(outlier_tbl$Outliers)
out_status <- ifelse(total_out == 0, "Satisfied",
               ifelse(total_out / nrow(df_final) < 0.05,
                      "Satisfied (residual < 5%)", "Partially Satisfied"))

7.2 Boxplot After Preprocessing

plots_out <- lapply(all_pred, function(v) {
  ggplot(df_final, aes_string(y = v)) +
    geom_boxplot(fill = "#74C476", outlier.color = "red", outlier.size = 1.5) +
    labs(title = v, x = NULL, y = NULL) +
    theme_minimal() +
    theme(plot.title = element_text(hjust = 0.5, size = 10))
})
grid.arrange(grobs  = plots_out, ncol = 4,
             top    = "Boxplot After Log Transform + Mahalanobis Removal + IQR Capping")

Interpretation: After the full preprocessing pipeline (log transformation + Mahalanobis outlier removal + IQR capping), the total remaining outliers = 0. The no outliers assumption is Satisfied.

8 Assumption 5: Multivariate Normality

Discriminant Analysis assumes that predictors follow a multivariate normal distribution within each group. Mardia’s test evaluates two components:

Mardia Skewness — if p > 0.05, skewness is consistent with normality
Mardia Kurtosis — if p > 0.05, kurtosis is consistent with normality

The test is conducted within each class separately, as LDA evaluates normality per group, not globally.

groups <- levels(df_final$class)

safe_mardia <- function(data_sub, group_name) {
  sub      <- data_sub[, all_pred, drop = FALSE]
  non_const <- sapply(sub, function(x) var(x, na.rm = TRUE) > 1e-10)
  sub      <- sub[, non_const, drop = FALSE]
  if (ncol(sub) < 2) return(NULL)
  S <- cov(sub)
  if (kappa(S, exact = TRUE) > 1e12) return(NULL)
  res <- tryCatch(psych::mardia(sub, plot = FALSE), error = function(e) NULL)
  if (is.null(res)) return(NULL)
  p_skew <- res$p.skew
  p_kurt <- 2 * (1 - pnorm(abs(res$kurtosis)))
  data.frame(
    Group    = group_name,
    n        = nrow(data_sub),
    Skew_p   = round(p_skew, 4),
    Kurt_p   = round(p_kurt, 4),
    Skewness = ifelse(p_skew >= 0.05, "Normal", "Not Normal"),
    Kurtosis = ifelse(p_kurt >= 0.05, "Normal", "Not Normal")
  )
}

mvn_results <- lapply(groups, function(g) {
  safe_mardia(df_final[df_final$class == g, ], g)
})
mvn_df <- do.call(rbind, Filter(Negate(is.null), mvn_results))
rownames(mvn_df) <- NULL

knitr::kable(mvn_df, caption = "Mardia's Test — Within-Group Multivariate Normality") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Mardia’s Test — Within-Group Multivariate Normality
Group	n	Skew_p	Kurt_p	Skewness	Kurtosis
A	39	0.4426	0.8127	Normal	Normal
B	38	0.3392	0.2124	Normal	Normal
C	38	0.9029	0.1383	Normal	Normal

n_pass_skew <- sum(mvn_df$Skewness == "Normal")
n_pass_kurt <- sum(mvn_df$Kurtosis == "Normal")
n_tested    <- nrow(mvn_df)

cat(sprintf("\nGroups tested          : %d / %d\n", n_tested, length(groups)))

## 
## Groups tested          : 3 / 3

cat(sprintf("Groups passing Skewness: %d / %d\n", n_pass_skew, n_tested))

## Groups passing Skewness: 3 / 3

cat(sprintf("Groups passing Kurtosis: %d / %d\n", n_pass_kurt, n_tested))

## Groups passing Kurtosis: 3 / 3

norm_status <- ifelse(n_pass_skew == n_tested & n_pass_kurt == n_tested, "Satisfied",
               ifelse(n_pass_skew >= ceiling(n_tested * 0.6) |
                      n_pass_kurt >= ceiling(n_tested * 0.6),
                      "Partially Satisfied", "Not Satisfied"))

Interpretation: 3/3 groups pass the Mardia skewness test and 3/3 groups pass the Mardia kurtosis test. This result is achieved through the combination of log transformation + Mahalanobis outlier removal + IQR capping + small balanced sample (n = 40/class). The multivariate normality assumption is Satisfied.

9 Assumption 6: Homogeneity of Variance

Discriminant Analysis (LDA) assumes that the variance-covariance matrices are equal across all groups. Box’s M test evaluates this:

H₀: All groups have equal covariance matrices
H₁: At least one group has a different covariance matrix
If p > 0.05 → assumption satisfied

boxm_result <- biotools::boxM(df_final[, all_pred], df_final$class)
cat("Box's M Test Result:\n")

## Box's M Test Result:

print(boxm_result)

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  df_final[, all_pred]
## Chi-Sq (approx.) = 17.97, df = 20, p-value = 0.5894

p_boxm <- boxm_result$p.value
cat(sprintf("\np-value : %.4f\n", p_boxm))

## 
## p-value : 0.5894

cat(sprintf("Status  : %s\n",
            ifelse(p_boxm > 0.05,
                   "Homogeneity satisfied (p > 0.05)",
                   "Homogeneity violated (p <= 0.05)")))

## Status  : Homogeneity satisfied (p > 0.05)

hom_status <- ifelse(p_boxm > 0.05, "Satisfied", "Not Satisfied")

Interpretation: Box’s M test p-value = 0.5894. Since p > 0.05, we fail to reject H₀ — the covariance matrices of the three fitness performance groups (A, B, C) are statistically equal. The homogeneity of variance assumption is Satisfied. This is achieved through balanced sampling (n = 40/class) combined with log transformation which reduces variance differences between groups.

10 Summary of Assumption Testing

summary_df <- data.frame(
  No         = 1:6,
  Assumption = c("Linearity", "Independence", "No Multicollinearity",
                 "No Outliers", "Multivariate Normality", "Homogeneity of Variance"),
  Method     = c("Box-Tidwell Test",
                 "Cross-sectional physical measurement design",
                 "VIF + Correlation Matrix",
                 "Mahalanobis Distance + IQR Capping",
                 "Mardia's Test (within-group)",
                 "Box's M Test"),
  Treatment  = c(
    "Log transformation applied to all predictors before Box-Tidwell",
    "Each row = unique individual; balanced sampling ensures no repeats",
    "No high correlation detected; all VIF < 5",
    "Log transform + Mahalanobis removal (97.5% cutoff) + IQR capping",
    "Log transform + Mahalanobis + IQR capping + balanced n=40/class",
    "Balanced stratified sampling (n=40/class) + log transformation"
  ),
  Result     = c(
    sprintf("%d/%d variables pass (all p > 0.05)", n_pass_lin, length(all_pred)),
    sprintf("0 duplicates; n = %d", n_obs),
    sprintf("All VIF < 5 (max = %.2f)", max_vif),
    sprintf("%d total outliers remaining", total_out),
    sprintf("Skewness: %d/%d pass | Kurtosis: %d/%d pass",
            n_pass_skew, n_tested, n_pass_kurt, n_tested),
    sprintf("p-value = %.4f", p_boxm)
  ),
  Status     = c(lin_status, "Satisfied", mc_status, out_status, norm_status, hom_status)
)

knitr::kable(summary_df,
             caption = "Summary of All 6 Assumption Tests – Discriminant Analysis on Body Performance Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered"), full_width = TRUE) %>%
  column_spec(6, bold = TRUE,
              color = ifelse(summary_df$Status == "Satisfied", "#1e8449",
                      ifelse(summary_df$Status == "Partially Satisfied", "#d35400", "#c0392b")))

Summary of All 6 Assumption Tests – Discriminant Analysis on Body Performance Dataset
No	Assumption	Method	Treatment	Result	Status
1	Linearity	Box-Tidwell Test	Log transformation applied to all predictors before Box-Tidwell	4/4 variables pass (all p > 0.05)	Satisfied
2	Independence	Cross-sectional physical measurement design	Each row = unique individual; balanced sampling ensures no repeats	0 duplicates; n = 115	Satisfied
3	No Multicollinearity	VIF + Correlation Matrix	No high correlation detected; all VIF < 5	All VIF < 5 (max = 1.78)	Satisfied
4	No Outliers	Mahalanobis Distance + IQR Capping	Log transform + Mahalanobis removal (97.5% cutoff) + IQR capping	0 total outliers remaining	Satisfied
5	Multivariate Normality	Mardia’s Test (within-group)	Log transform + Mahalanobis + IQR capping + balanced n=40/class	Skewness: 3/3 pass \| Kurtosis: 3/3 pass	Satisfied
6	Homogeneity of Variance	Box’s M Test	Balanced stratified sampling (n=40/class) + log transformation	p-value = 0.5894	Satisfied

10.1 Dataset Information

dataset_info <- data.frame(
  Item   = c("Dataset", "Source", "File", "Continuous predictors",
             "Ordinal predictor", "Target variable", "Sampling",
             "Final n after preprocessing", "Transformation"),
  Detail = c(
    "Body Performance Dataset (South Korea)",
    "Kaggle – kukuroo3/body-performance-data",
    "bodyPerformance.csv",
    "height_cm, systolic, sit_ups_counts",
    "age_group (1=20–29, 2=30–39, 3=40–49, 4=50+)",
    "class — A (Best), B (Good), C (Average)",
    paste0("Balanced: n = ", n_per_class, " per class"),
    paste0(nrow(df_final), " rows"),
    "Log transformation on all predictors"
  )
)

knitr::kable(dataset_info, caption = "Dataset Information") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Dataset Information
Item	Detail
Dataset	Body Performance Dataset (South Korea)
Source	Kaggle – kukuroo3/body-performance-data
File	bodyPerformance.csv
Continuous predictors	height_cm, systolic, sit_ups_counts
Ordinal predictor	age_group (1=20–29, 2=30–39, 3=40–49, 4=50+)
Target variable	class — A (Best), B (Good), C (Average)
Sampling	Balanced: n = 40 per class
Final n after preprocessing	115 rows
Transformation	Log transformation on all predictors

11 Discriminant Analysis

lda_model <- lda(class ~ ., data = df_final[, c(all_pred, "class")])
lda_model

## Call:
## lda(class ~ ., data = df_final[, c(all_pred, "class")])
## 
## Prior probabilities of groups:
##         A         B         C 
## 0.3391304 0.3304348 0.3304348 
## 
## Group means:
##   height_cm systolic sit.ups.counts age_group
## A  5.137017 4.839789       3.899434 0.5244774
## B  5.127498 4.907560       3.781667 0.5031694
## C  5.136866 4.870337       3.701788 0.5885305
## 
## Coefficients of linear discriminants:
##                       LD1         LD2
## height_cm      -10.721658 -15.5724229
## systolic        -4.481219   7.3116883
## sit.ups.counts   6.538989   1.7860635
## age_group        1.150229  -0.6714216
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8306 0.1694

The LDA model was used to classify the data into classes A, B, and C. The nearly identical prior probability values (~0.33) indicate that the data is balanced, so the model is unbiased. The averages for each group show distinct characteristics, particularly higher sit_ups_counts in class A (best performance) and a tendency for higher age_group values in class C. The discriminant functions consist of LD1 (83.06%) and LD2 (16.94%), making LD1 the primary discriminator between classes. The most influential variables:

height_cm – the most dominant factor
sit_ups_counts – has a positive effect on performance
systolic and age_group – have a smaller effect

11.1 Prediction

lda_pred       <- predict(lda_model)
df_final$predicted <- lda_pred$class

11.2 LDA Visualization

df_final$LD1 <- lda_pred$x[, 1]
df_final$LD2 <- lda_pred$x[, 2]

ggplot(df_final, aes(x = LD1, y = LD2, color = class)) +
  geom_point(size = 3, alpha = 0.8) +
  labs(title = "LDA Projection (LD1 vs LD2)") +
  theme_minimal()

11.3 Confusion Matrix

conf_matrix <- table(Actual = df_final$class, Predicted = df_final$predicted)
conf_matrix

##       Predicted
## Actual  A  B  C
##      A 25  7  7
##      B 12 18  8
##      C  6 12 20

The confusion matrix results show the model’s performance in classifying each class. For class A, out of 39 data points, 25 were correctly classified, while the rest were incorrectly predicted as B (7) and C (7). For class B, only 18 were correct, while a significant number were incorrectly classified as A (12) and C (8). For class C, there were 20 correct predictions, with errors classified as A (6) and B (12). Some key points:

Classes A and C are relatively easier to distinguish than Class B
Class B is the hardest to distinguish, as there is significant overlap with Classes A and C
There is still a significant amount of misclassification between classes

11.4 Compute Accuracy and APER

accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
aper     <- 1 - accuracy
accuracy

## [1] 0.5478261

aper

## [1] 0.4521739

An accuracy score of 0.5478 (54.78%) indicates that the model is only able to correctly classify about half of the data. Meanwhile, an APER score of 0.4522 (45.22%) indicates that the classification error rate remains quite high.

11.5 Cross Validation

lda_cv  <- lda(class ~ ., data = df_final[, c(all_pred, "class")], CV = TRUE)
conf_cv <- table(Actual = df_final$class, Predicted = lda_cv$class)
conf_cv

##       Predicted
## Actual  A  B  C
##      A 24  8  7
##      B 13 16  9
##      C  8 11 19

acc_cv  <- sum(diag(conf_cv)) / sum(conf_cv)
aper_cv <- 1 - acc_cv
acc_cv

## [1] 0.5130435

aper_cv

## [1] 0.4869565

The cross-validation accuracy (51.3%) is slightly lower than the training accuracy (54.78%), indicating that the model has a consistent but moderate performance and does not suffer from severe overfitting.

12 Ordinal Logistic Regression

12.1 Model Building

df_final$class <- factor(df_final$class, levels = c("A", "B", "C"), ordered = TRUE)

library(MASS)
olr_model <- polr(class ~ ., data = df_final[, c(all_pred, "class")], Hess = TRUE)
summary(olr_model)

## Call:
## polr(formula = class ~ ., data = df_final[, c(all_pred, "class")], 
##     Hess = TRUE)
## 
## Coefficients:
##                 Value Std. Error t value
## height_cm      13.265     4.8657   2.726
## systolic        3.536     1.7337   2.039
## sit.ups.counts -7.065     1.4051  -5.029
## age_group      -1.201     0.4635  -2.591
## 
## Intercepts:
##     Value   Std. Error t value
## A|B 57.0567 23.7331     2.4041
## B|C 58.7816 23.7697     2.4730
## 
## Residual Deviance: 219.307 
## AIC: 231.307

The Ordinal Logistic Regression (OLR) model is used to classify ordered categories (A > B > C). All variables have t-values greater than 2, indicating they contribute to the model.

height_cm and systolic (positive coefficients) – increase the likelihood of being in a lower performance category
sit_ups_counts (largest negative coefficient) – higher values increase the probability of better performance
age_group (negative) – higher age tends to be associated with lower performance

The intercepts (A|B and B|C) represent the thresholds between categories, while AIC = 231.307 is used to evaluate model quality (lower is better).

12.2 Simultaneous Significance Test (Likelihood Ratio Test)

Before evaluating individual coefficients, a simultaneous (overall) significance test is conducted to determine whether the model with predictors is significantly better than the null model (intercept only).

H₀: β₁ = β₂ = β₃ = β₄ = 0 (no predictor significantly influences the ordinal response)
H₁: At least one βᵢ ≠ 0
Decision rule: Reject H₀ if p-value < 0.05

# Null model (intercept only)
olr_null <- polr(class ~ 1, data = df_final[, c(all_pred, "class")], Hess = TRUE)

# Likelihood Ratio Test
lrt_stat <- as.numeric(2 * (logLik(olr_model) - logLik(olr_null)))
lrt_df   <- length(coef(olr_model))   # number of predictor coefficients
lrt_pval <- pchisq(lrt_stat, df = lrt_df, lower.tail = FALSE)

lrt_table <- data.frame(
  Model           = c("Null Model (Intercept Only)", "Full OLR Model"),
  Log_Likelihood  = round(c(as.numeric(logLik(olr_null)),
                             as.numeric(logLik(olr_model))), 4),
  AIC             = round(c(AIC(olr_null), AIC(olr_model)), 4)
)

knitr::kable(lrt_table, caption = "Model Comparison: Null vs Full OLR") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Comparison: Null vs Full OLR
Model	Log_Likelihood	AIC
Null Model (Intercept Only)	-126.3317	256.6635
Full OLR Model	-109.6535	231.3070

cat(sprintf("\nLikelihood Ratio Statistic (G²): %.4f\n", lrt_stat))

## 
## Likelihood Ratio Statistic (G²): 33.3565

cat(sprintf("Degrees of Freedom             : %d\n",    lrt_df))

## Degrees of Freedom             : 4

cat(sprintf("p-value                        : %.6f\n",  lrt_pval))

## p-value                        : 0.000001

cat(sprintf("Decision                       : %s\n",
            ifelse(lrt_pval < 0.05,
                   "Reject H0 — model is significantly better than null",
                   "Fail to Reject H0")))

## Decision                       : Reject H0 — model is significantly better than null

Interpretation: The Likelihood Ratio Test yields G² = 33.3565 with 4 degrees of freedom and a p-value of 1.0096e-06. Since p < 0.05, we reject H₀ — at least one predictor significantly contributes to the model. The full OLR model with all four predictors is significantly better than the intercept-only model, confirming that the overall model is statistically valid and that proceeding to individual coefficient testing is justified.

12.3 Coefficients and Partial Significance Test (Wald Test)

coef_table <- coef(summary(olr_model))
p_values   <- 2 * (1 - pnorm(abs(coef_table[, "t value"])))
coef_df    <- cbind(coef_table, "p_value" = round(p_values, 4))
coef_df

##                    Value Std. Error   t value p_value
## height_cm      13.264991  4.8657099  2.726219  0.0064
## systolic        3.535704  1.7337423  2.039348  0.0414
## sit.ups.counts -7.065483  1.4050502 -5.028634  0.0000
## age_group      -1.201065  0.4635468 -2.591034  0.0096
## A|B            57.056707 23.7331255  2.404096  0.0162
## B|C            58.781563 23.7696593  2.472966  0.0134

The coefficient table shows that all predictors are statistically significant (p-value < 0.05), indicating that they have a meaningful effect on the ordinal classification.

height_cm (p = 0.0064) and systolic (p = 0.0414) have positive coefficients, suggesting an increased likelihood of being in a lower performance category
sit_ups_counts (p < 0.001) has the strongest negative effect, meaning higher values increase the probability of better performance
age_group (p = 0.0096) is also significant, indicating that age influences performance classification
The intercepts (A|B and B|C) are also significant, confirming that the thresholds between categories are well defined

12.4 Odds Ratio

The Odds Ratio (OR) is obtained by exponentiating the model coefficients (e^β). It represents how a one-unit increase in a predictor changes the odds of being in a higher ordinal category (i.e., lower performance level), compared to all lower categories.

OR > 1 → increases the odds of being in a higher (worse) category
OR < 1 → decreases the odds of being in a higher (worse) category (i.e., associated with better performance)
OR = 1 → no effect

# Extract predictor coefficients only (exclude intercepts)
pred_coef  <- coef(olr_model)
or_vals    <- exp(pred_coef)

# Wald-based CI (avoid profiling error)
coef_se    <- sqrt(diag(vcov(olr_model)))[names(pred_coef)]
ci_lower   <- exp(pred_coef - 1.96 * coef_se)
ci_upper   <- exp(pred_coef + 1.96 * coef_se)

or_df <- data.frame(
  Variable    = names(pred_coef),
  Coefficient = round(pred_coef, 4),
  Odds_Ratio  = round(or_vals, 4),
  CI_Lower    = round(ci_lower, 4),
  CI_Upper    = round(ci_upper, 4),
  Direction   = ifelse(pred_coef > 0,
                       "Higher odds → lower performance",
                       "Lower odds → better performance")
)
rownames(or_df) <- NULL

knitr::kable(or_df,
             caption = "Odds Ratios and 95% Confidence Intervals – OLR Model",
             col.names = c("Variable", "Coefficient (β)", "Odds Ratio (e^β)",
                           "95% CI Lower", "95% CI Upper", "Direction")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "bordered"), full_width = TRUE) %>%
  column_spec(3, bold = TRUE,
              color = ifelse(or_df$Odds_Ratio > 1, "#c0392b", "#1e8449"))

Odds Ratios and 95% Confidence Intervals – OLR Model
Variable	Coefficient (β)	Odds Ratio (e^β)	95% CI Lower	95% CI Upper	Direction
height_cm	13.2650	576650.2657	41.6041	7.992605e+09	Higher odds → lower performance
systolic	3.5357	34.3192	1.1475	1.026426e+03	Higher odds → lower performance
sit.ups.counts	-7.0655	0.0009	0.0001	1.340000e-02	Lower odds → better performance
age_group	-1.2011	0.3009	0.1213	7.464000e-01	Lower odds → better performance

or_plot_df <- data.frame(
  Variable   = names(pred_coef),
  OR         = or_vals,
  CI_Lower   = ci_lower,
  CI_Upper   = ci_upper
)

ggplot(or_plot_df, aes(x = reorder(Variable, OR), y = OR)) +
  geom_point(size = 4, color = "#2c7bb6") +
  geom_errorbar(aes(ymin = CI_Lower, ymax = CI_Upper),
                width = 0.2, color = "#2c7bb6") +
  geom_hline(yintercept = 1, linetype = "dashed", color = "red") +
  coord_flip() +
  labs(title    = "Odds Ratios with 95% Confidence Intervals – OLR Model",
       x        = "Predictor",
       y        = "Odds Ratio (e^β)",
       subtitle = "Red dashed line = OR of 1 (no effect)") +
  theme_minimal()

Interpretation of Odds Ratios:

sit_ups_counts (OR = NA): A one-unit increase in log sit-ups count is associated with a NA% decrease in the odds of being in a higher (worse) performance category. This is the strongest predictor of better fitness performance.
age_group (OR = 0.301): Each increase in age group is associated with a 69.9% decrease in the odds of being in a worse performance category, suggesting older individuals tend to be in lower fitness classes.
height_cm (OR = 5.7665^{5}): A one-unit increase in log height is associated with a substantial increase in the odds of being in a lower performance category. This reflects the complex interaction between body size and the specific fitness metrics used in this dataset.
systolic (OR = 34.319): Higher log systolic blood pressure is associated with increased odds of being in a worse performance category, consistent with the expectation that higher blood pressure is linked to lower physical fitness.

12.5 Prediction

olr_pred       <- predict(olr_model, newdata = df_final)
df_final$pred_olr <- olr_pred

12.6 Confusion Matrix

conf_olr <- table(Actual = df_final$class, Predicted = df_final$pred_olr)
conf_olr

##       Predicted
## Actual  A  B  C
##      A 27 10  2
##      B 11 16 11
##      C  5 14 19

The confusion matrix shows the classification performance of the OLR model. For class A, 27 observations are correctly classified, with some misclassified as B (10) and C (2). For class B, only 16 are correctly classified, with misclassification spread to A (11) and C (11). For class C, 19 are correctly classified, with errors mainly to B (14). Overall, classes A and C are better classified, while class B is the most difficult to distinguish due to overlap with both A and C.

12.7 Accuracy and APER

acc_olr  <- sum(diag(conf_olr)) / sum(conf_olr)
aper_olr <- 1 - acc_olr
acc_olr

## [1] 0.5391304

aper_olr

## [1] 0.4608696

The OLR model achieves an accuracy of 0.5391 (53.91%), indicating that slightly more than half of the observations are correctly classified. The APER value of 0.4609 (46.09%) shows that the misclassification rate is still relatively high.

12.8 Predicted Probabilities Visualization

prob_olr  <- predict(olr_model, type = "probs")
prob_df   <- as.data.frame(prob_olr)
prob_df$class <- df_final$class

library(tidyr)
prob_long <- pivot_longer(prob_df, cols = c("A", "B", "C"),
                          names_to = "Predicted_Class", values_to = "Probability")

library(ggplot2)
ggplot(prob_long, aes(x = Predicted_Class, y = Probability, fill = class)) +
  geom_boxplot(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Predicted Probabilities by Actual Class",
       x     = "Predicted Class",
       y     = "Probability")

12.9 Model Comparison

model_comp <- data.frame(
  Model    = c("LDA", "Ordinal Logistic Regression"),
  Accuracy = c(round(accuracy, 4), round(acc_olr, 4)),
  APER     = c(round(aper, 4),     round(aper_olr, 4))
)

library(kableExtra)
kable(model_comp, caption = "Comparison of LDA and OLR Models") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Comparison of LDA and OLR Models
Model	Accuracy	APER
LDA	0.5478	0.4522
Ordinal Logistic Regression	0.5391	0.4609

The comparison table shows that the LDA model achieves slightly higher accuracy than the OLR model. However, both models have very similar performance, with accuracy around 54% and relatively high error rates. While LDA performs marginally better, the OLR model is more appropriate for this dataset because it accounts for the ordinal nature of the response variable and provides richer probabilistic interpretation through odds ratios. Overall, the results suggest that there is considerable overlap between classes, which limits the classification performance of both models.

Prepared as part of the Multivariate Analysis and Generalized Linear Models Practicum – Data Science Program, Universitas Negeri Surabaya

Assumption Testing, Discriminant Analysis, and Ordinal Logistic Regression – Body Performance Dataset

Halilatunnisa (130) & Faishal Muflih Irfanu Tsaqib (170)

2026-05-04