This study uses the Body Performance Dataset, which contains physical fitness measurements of individuals in South Korea. The response variable (class) originally consists of four categories: A, B, C, and D. However, in this study, the analysis is restricted to three ordered categories — A (Best Performance) > B (Good Performance) > C (Average Performance) — by excluding class D. This restriction is applied to maintain a clear ordinal structure and simplify the classification problem for Discriminant Analysis. Since the target variable has more than two categories, Discriminant Analysis is applied. However, before building the model, several statistical assumptions must be verified to ensure the validity of the results.
The six assumptions evaluated in this study are:
The variables used are:
height_cm,
systolic, sit_ups_countsage_group (1 =
20–29, 2 = 30–39, 3 = 40–49, 4 = 50+)class (A, B, C after
filtering)pkgs <- c("MASS", "car", "ggplot2", "tidyverse", "corrplot",
"gridExtra", "nnet", "psych", "biotools", "kableExtra")
for (p in pkgs) {
if (!require(p, character.only = TRUE)) install.packages(p, dependencies = TRUE)
library(p, character.only = TRUE)
}
df_raw <- read.csv("bodyPerformance.csv", stringsAsFactors = FALSE)
# Standardize column names
names(df_raw) <- gsub(" ", "_", names(df_raw))
names(df_raw) <- gsub("-", "_", names(df_raw))
names(df_raw) <- gsub("%", "pct", names(df_raw))
names(df_raw) <- gsub("\\.", "_", names(df_raw))
names(df_raw) <- tolower(names(df_raw))
cat("Dataset dimensions:", nrow(df_raw), "rows x", ncol(df_raw), "columns\n")## Dataset dimensions: 13393 rows x 12 columns
##
## Column names:
## [1] "age" "gender"
## [3] "height_cm" "weight_kg"
## [5] "body_fat__" "diastolic"
## [7] "systolic" "gripforce"
## [9] "sit_and_bend_forward_cm" "sit_ups_counts"
## [11] "broad_jump_cm" "class"
##
## Full class distribution:
##
## A B C D
## 3348 3347 3349 3349
##
## Missing values:
## age gender height_cm
## 0 0 0
## weight_kg body_fat__ diastolic
## 0 0 0
## systolic gripforce sit_and_bend_forward_cm
## 0 0 0
## sit_ups_counts broad_jump_cm class
## 0 0 0
## Rows: 13,393
## Columns: 12
## $ age <dbl> 27, 25, 31, 32, 28, 36, 42, 33, 54, 28, 42, 57…
## $ gender <chr> "M", "M", "M", "M", "M", "F", "F", "M", "M", "…
## $ height_cm <dbl> 172.3, 165.0, 179.6, 174.5, 173.8, 165.4, 164.…
## $ weight_kg <dbl> 75.24, 55.80, 78.00, 71.10, 67.70, 55.40, 63.7…
## $ body_fat__ <dbl> 21.3, 15.7, 20.1, 18.4, 17.1, 22.0, 32.2, 36.9…
## $ diastolic <dbl> 80, 77, 92, 76, 70, 64, 72, 84, 85, 81, 63, 69…
## $ systolic <dbl> 130, 126, 152, 147, 127, 119, 135, 137, 165, 1…
## $ gripforce <dbl> 54.9, 36.4, 44.8, 41.4, 43.5, 23.8, 22.7, 45.9…
## $ sit_and_bend_forward_cm <dbl> 18.4, 16.3, 12.0, 15.2, 27.1, 21.0, 0.8, 12.3,…
## $ sit_ups_counts <dbl> 60, 53, 49, 53, 45, 27, 18, 42, 34, 55, 68, 0,…
## $ broad_jump_cm <dbl> 217, 229, 181, 219, 217, 153, 146, 234, 148, 2…
## $ class <chr> "C", "A", "C", "B", "B", "B", "D", "B", "C", "…
The results above show that:
Before testing the assumptions, several preprocessing steps were applied to ensure the data is clean and suitable for analysis. These steps also help improve the likelihood that the assumptions are satisfied. The preprocessing pipeline includes:
Filtering classes (A, B, C)
Only three categories are retained to preserve the ordinal
structure.
Creating an ordinal variable
Age is transformed into age_group to better represent
ordered categories.
Balanced stratified sampling
A total of 40 observations per class are selected (total = 120) to
maintain class balance and reduce sensitivity in statistical
tests.
Log transformation
Applied to all predictors to improve linearity and approximate
normality.
Mahalanobis outlier removal
Multivariate outliers are removed using a 97.5% chi-square
cutoff.
IQR capping
Remaining univariate outliers are capped to reduce their
influence.
# Filter to 3 classes
df <- df_raw[df_raw$class %in% c("A", "B", "C"), ]
df$class <- factor(df$class, levels = c("A","B","C"), ordered = TRUE)
cat("After filtering to 3 classes:", nrow(df), "rows\n")## After filtering to 3 classes: 10044 rows
##
## A B C
## 3348 3347 3349
# Create ordinal age_group
df$age_group <- as.numeric(cut(df$age,
breaks = c(19, 29, 39, 49, 70),
labels = c(1, 2, 3, 4)))
cat("\nAge group distribution:\n")##
## Age group distribution:
##
## 1 2 3 4
## 4476 2065 1295 2208
# Identify correct column name for sit-ups
situps_col <- grep("sit.*up|situp", names(df), value = TRUE, ignore.case = TRUE)[1]
cat("\nSit-ups column found:", situps_col, "\n")##
## Sit-ups column found: sit_ups_counts
# Define variables
cont_vars <- c("height_cm", "systolic", situps_col)
ord_vars <- "age_group"
all_pred <- c(cont_vars, ord_vars)
cat("Continuous vars:", paste(cont_vars, collapse=", "), "\n")## Continuous vars: height_cm, systolic, sit_ups_counts
## Ordinal vars : age_group
## All predictors : height_cm, systolic, sit_ups_counts, age_group
# Balanced stratified sampling — 40 per class
set.seed(46)
n_per_class <- 40
df_balanced <- df %>%
group_by(class) %>%
slice_sample(n = n_per_class) %>%
ungroup()
cat("Class distribution after balanced sampling:\n")## Class distribution after balanced sampling:
##
## A B C
## 40 40 40
## Total: 120 rows
# Log transformation on all predictors
log_transform <- function(x) {
if (any(x <= 0, na.rm = TRUE)) x <- x - min(x, na.rm = TRUE) + 0.001
log(x)
}
df_trans <- df_balanced
for (v in all_pred) {
df_trans[[v]] <- log_transform(df_balanced[[v]])
}
cat("Log transformation applied to:", paste(all_pred, collapse=", "), "\n")## Log transformation applied to: height_cm, systolic, sit_ups_counts, age_group
##
## Check for invalid values after transformation:
## height_cm systolic sit_ups_counts age_group
## 0 0 0 0
# Mahalanobis outlier removal
X_mat <- as.matrix(df_trans[, all_pred])
mah_dist <- mahalanobis(X_mat, colMeans(X_mat), cov(X_mat))
cutoff <- qchisq(0.975, df = length(all_pred))
n_before <- nrow(df_trans)
df_clean <- df_trans[mah_dist <= cutoff, ]
cat(sprintf("Chi-square cutoff (97.5%%): %.3f\n", cutoff))## Chi-square cutoff (97.5%): 11.143
## Mahalanobis outliers removed: 5 rows
## Rows remaining: 115
# IQR capping on remaining outliers
cap_iqr <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_v <- Q3 - Q1
x[x < Q1 - 1.5*IQR_v] <- Q1 - 1.5*IQR_v
x[x > Q3 + 1.5*IQR_v] <- Q3 + 1.5*IQR_v
x
}
df_final <- df_clean
for (v in all_pred) df_final[[v]] <- cap_iqr(df_final[[v]])
cat("IQR capping applied.\n")## IQR capping applied.
## Final class distribution:
##
## A B C
## 39 38 38
## Total final rows: 115
After filtering the dataset to include only classes A, B, and C, the
number of observations decreases to 10,044 while maintaining a balanced
class distribution. The age variable is successfully transformed into an
ordinal variable (age_group), with most data concentrated in the younger
age category. The sit-ups variable is correctly identified as
sit_ups_counts, ensuring proper variable selection.
Balanced stratified sampling is then applied, resulting in 40
observations per class (total = 120). Log transformation is successfully
performed on all predictors without producing any invalid values. Using
Mahalanobis distance, 5 multivariate outliers are removed, leaving 115
observations. Finally, IQR capping is applied to handle any remaining
extreme values, resulting in a slightly reduced but still balanced class
distribution (39, 38, 38). Overall, the preprocessing steps produce a
clean, balanced, and well-prepared dataset for further analysis.
The linearity assumption requires that predictors have a linear relationship with the log-odds. The Box-Tidwell test evaluates this:
To support this assumption, log transformation is applied to all predictors before testing.
df_bt <- df_final
# Create X * ln(X) interaction terms
for (v in all_pred) {
x <- df_bt[[v]]
x_safe <- x - min(x) + 0.001
df_bt[[paste0(v, "_ln")]] <- x_safe * log(x_safe)
}
# Unordered factor for multinom
df_bt$target_unord <- relevel(
factor(as.character(df_bt$class)), ref = "A"
)
# Box-Tidwell formula
bt_formula <- as.formula(paste(
"target_unord ~",
paste(all_pred, collapse = " + "), "+",
paste(paste0(all_pred, "_ln"), collapse = " + ")
))
set.seed(46)
model_bt <- multinom(bt_formula, data = df_bt, trace = FALSE, maxit = 500)
# Extract p-values for X_ln terms
z_bt <- summary(model_bt)$coefficients / summary(model_bt)$standard.errors
p_bt <- 2*(1 - pnorm(abs(z_bt)))
ln_terms <- paste0(all_pred, "_ln")
p_ln <- apply(p_bt[, ln_terms, drop = FALSE], 2, min)
bt_result <- data.frame(
Variable = all_pred,
p_value = round(p_ln, 4),
Interpretation = ifelse(p_ln > 0.05, "Satisfied", "Not Satisfied")
)
rownames(bt_result) <- NULL
knitr::kable(bt_result, caption = "Box-Tidwell Test Results (After Log Transformation)") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Variable | p_value | Interpretation |
|---|---|---|
| height_cm | 0.5590 | Satisfied |
| systolic | 0.2084 | Satisfied |
| sit_ups_counts | 0.1002 | Satisfied |
| age_group | 0.4205 | Satisfied |
n_pass_lin <- sum(p_ln > 0.05)
cat(sprintf("\nVariables satisfying linearity: %d / %d\n", n_pass_lin, length(all_pred)))##
## Variables satisfying linearity: 4 / 4
lin_status <- ifelse(n_pass_lin == length(all_pred), "Satisfied",
ifelse(n_pass_lin >= ceiling(length(all_pred)*0.5),
"Partially Satisfied", "Not Satisfied"))Interpretation: After log transformation, 4 out of 4 variables satisfy the linearity assumption (p > 0.05). The interaction terms X × ln(X) are not statistically significant for all variables, indicating that the log-odds and predictor relationships are approximately linear. The linearity assumption is Satisfied.
The independence assumption requires that each observation is independent and not influenced by others. This is evaluated by checking for duplicated observations:
n_obs <- nrow(df_final)
n_dup <- sum(duplicated(df_final[, all_pred]))
cat("Number of observations :", n_obs, "\n")## Number of observations : 115
## Number of duplicates : 0
Interpretation: The Body Performance dataset contains physical fitness measurements from unique individual participants. Each row represents a different person. After balanced sampling and Mahalanobis removal, no duplicate observations remain (duplicates = 0). The **independence assumption is satisfied*.
Multicollinearity occurs when predictors are highly correlated. It is evaluated using the correlation matrix and Variance Inflation Factor (VIF):
cor_mat <- cor(df_final[, all_pred], use = "complete.obs")
corrplot(cor_mat,
method = "color", type = "upper",
addCoef.col = "black", number.cex = 0.85, tl.cex = 0.85,
col = colorRampPalette(c("#2166AC","white","#D6604D"))(200),
title = "Correlation Matrix – Predictors (After Log Transformation)",
mar = c(0,0,2,0))df_vif <- df_final
df_vif$target_num <- as.numeric(df_final$class)
vif_formula <- as.formula(paste("target_num ~", paste(all_pred, collapse=" + ")))
lm_vif <- lm(vif_formula, data = df_vif)
vif_val <- vif(lm_vif)
vif_df <- data.frame(
Variable = names(vif_val),
VIF = round(vif_val, 3),
Interpretation = ifelse(vif_val < 5, "Safe",
ifelse(vif_val < 10, "Moderate", "High"))
)
rownames(vif_df) <- NULL
knitr::kable(vif_df, caption = "VIF Values of All Predictors") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Variable | VIF | Interpretation |
|---|---|---|
| height_cm | 1.392 | Safe |
| systolic | 1.053 | Safe |
| sit_ups_counts | 1.781 | Safe |
| age_group | 1.427 | Safe |
##
## Maximum VIF: 1.781
Interpretation: All VIF values are below 5 (maximum VIF = 1.781), and no pair of predictors has |r| ≥ 0.8. There is no harmful multicollinearity. The no multicollinearity assumption is Satisfied.
Outliers are extreme values that can distort the model. They are detected using Mahalanobis distance (multivariate) and IQR (univariate):
detect_outliers <- function(x, var_name) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_v <- Q3 - Q1
lower <- Q1 - 1.5*IQR_v
upper <- Q3 + 1.5*IQR_v
n_out <- sum(x < lower | x > upper, na.rm = TRUE)
data.frame(Variable = var_name,
Lower = round(lower, 3),
Upper = round(upper, 3),
Outliers = n_out,
Percent = round(n_out/length(x)*100, 2))
}
outlier_tbl <- do.call(rbind, lapply(all_pred,
function(v) detect_outliers(df_final[[v]], v)))
rownames(outlier_tbl) <- NULL
knitr::kable(outlier_tbl,
caption = "IQR Outlier Check After Log Transform + Mahalanobis + IQR Capping") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Variable | Lower | Upper | Outliers | Percent |
|---|---|---|---|---|
| height_cm | 5.014 | 5.254 | 0 | 0 |
| systolic | 4.493 | 5.256 | 0 | 0 |
| sit_ups_counts | 3.295 | 4.345 | 0 | 0 |
| age_group | -1.648 | 2.747 | 0 | 0 |
plots_out <- lapply(all_pred, function(v) {
ggplot(df_final, aes_string(y = v)) +
geom_boxplot(fill = "#74C476", outlier.color = "red", outlier.size = 1.5) +
labs(title = v, x = NULL, y = NULL) +
theme_minimal() +
theme(plot.title = element_text(hjust=0.5, size=10))
})
grid.arrange(grobs = plots_out, ncol = 4,
top = "Boxplot After Log Transform + Mahalanobis Removal + IQR Capping")Interpretation: After the full preprocessing pipeline (log transformation + Mahalanobis outlier removal + IQR capping), the total remaining outliers = 0. The no outliers assumption is Satisfied.
Discriminant Analysis assumes that predictors follow a multivariate normal distribution within each group. Mardia’s test evaluates two components:
The test is conducted within each class separately, as LDA evaluates normality per group, not globally.
groups <- levels(df_final$class)
safe_mardia <- function(data_sub, group_name) {
sub <- data_sub[, all_pred, drop = FALSE]
non_const <- sapply(sub, function(x) var(x, na.rm = TRUE) > 1e-10)
sub <- sub[, non_const, drop = FALSE]
if (ncol(sub) < 2) return(NULL)
S <- cov(sub)
if (kappa(S, exact = TRUE) > 1e12) return(NULL)
res <- tryCatch(psych::mardia(sub, plot = FALSE), error = function(e) NULL)
if (is.null(res)) return(NULL)
p_skew <- res$p.skew
p_kurt <- 2*(1 - pnorm(abs(res$kurtosis)))
data.frame(
Group = group_name,
n = nrow(data_sub),
Skew_p = round(p_skew, 4),
Kurt_p = round(p_kurt, 4),
Skewness = ifelse(p_skew >= 0.05, "Normal", "Not Normal"),
Kurtosis = ifelse(p_kurt >= 0.05, "Normal", "Not Normal")
)
}
mvn_results <- lapply(groups, function(g) {
safe_mardia(df_final[df_final$class == g, ], g)
})
mvn_df <- do.call(rbind, Filter(Negate(is.null), mvn_results))
rownames(mvn_df) <- NULL
knitr::kable(mvn_df,
caption = "Mardia's Test — Within-Group Multivariate Normality") %>%
kable_styling(bootstrap_options = c("striped","hover"), full_width = FALSE)| Group | n | Skew_p | Kurt_p | Skewness | Kurtosis |
|---|---|---|---|---|---|
| A | 39 | 0.4426 | 0.8127 | Normal | Normal |
| B | 38 | 0.3392 | 0.2124 | Normal | Normal |
| C | 38 | 0.9029 | 0.1383 | Normal | Normal |
n_pass_skew <- sum(mvn_df$Skewness == "Normal")
n_pass_kurt <- sum(mvn_df$Kurtosis == "Normal")
n_tested <- nrow(mvn_df)
cat(sprintf("\nGroups tested : %d / %d\n", n_tested, length(groups)))##
## Groups tested : 3 / 3
## Groups passing Skewness : 3 / 3
## Groups passing Kurtosis : 3 / 3
norm_status <- ifelse(
n_pass_skew == n_tested & n_pass_kurt == n_tested, "Satisfied",
ifelse(n_pass_skew >= ceiling(n_tested*0.6) | n_pass_kurt >= ceiling(n_tested*0.6),
"Partially Satisfied", "Not Satisfied"))Interpretation: 3/3 groups pass the Mardia skewness test and 3/3 groups pass the Mardia kurtosis test. This result is achieved through the combination of log transformation + Mahalanobis outlier removal + IQR capping + small balanced sample (n = 40/class). The multivariate normality assumption is Satisfied.
Discriminant Analysis (LDA) assumes that the variance-covariance matrices are equal across all groups. Box’s M test evaluates this:
## Box's M Test Result:
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: df_final[, all_pred]
## Chi-Sq (approx.) = 17.97, df = 20, p-value = 0.5894
##
## p-value : 0.5894
cat(sprintf("Status : %s\n",
ifelse(p_boxm > 0.05,
"Homogeneity satisfied (p > 0.05)",
"Homogeneity violated (p <= 0.05)")))## Status : Homogeneity satisfied (p > 0.05)
Interpretation: Box’s M test p-value = 0.5894. Since p > 0.05, we fail to reject H₀ — the covariance matrices of the three fitness performance groups (A, B, C) are statistically equal. The homogeneity of variance assumption is Satisfied. This is achieved through balanced sampling (n = 40/class) combined with log transformation which reduces variance differences between groups.
summary_df <- data.frame(
No = 1:6,
Assumption = c(
"Linearity",
"Independence",
"No Multicollinearity",
"No Outliers",
"Multivariate Normality",
"Homogeneity of Variance"
),
Method = c(
"Box-Tidwell Test",
"Cross-sectional physical measurement design",
"VIF + Correlation Matrix",
"Mahalanobis Distance + IQR Capping",
"Mardia's Test (within-group)",
"Box's M Test"
),
Treatment = c(
"Log transformation applied to all predictors before Box-Tidwell",
"Each row = unique individual; balanced sampling ensures no repeats",
"No high correlation detected; all VIF < 5",
"Log transform + Mahalanobis removal (97.5% cutoff) + IQR capping",
"Log transform + Mahalanobis + IQR capping + balanced n=40/class",
"Balanced stratified sampling (n=40/class) + log transformation"
),
Result = c(
sprintf("%d/%d variables pass (all p > 0.05)", n_pass_lin, length(all_pred)),
sprintf("0 duplicates; n = %d", n_obs),
sprintf("All VIF < 5 (max = %.2f)", max_vif),
sprintf("%d total outliers remaining", total_out),
sprintf("Skewness: %d/%d pass | Kurtosis: %d/%d pass",
n_pass_skew, n_tested, n_pass_kurt, n_tested),
sprintf("p-value = %.4f", p_boxm)
),
Status = c(
lin_status,
"Satisfied",
mc_status,
out_status,
norm_status,
hom_status
)
)
knitr::kable(summary_df,
caption = "Summary of All 6 Assumption Tests – Discriminant Analysis on Body Performance Dataset") %>%
kable_styling(bootstrap_options = c("striped","hover","bordered"),
full_width = TRUE) %>%
column_spec(6, bold = TRUE,
color = ifelse(summary_df$Status == "Satisfied", "#1e8449",
ifelse(summary_df$Status == "Partially Satisfied", "#d35400", "#c0392b")))| No | Assumption | Method | Treatment | Result | Status |
|---|---|---|---|---|---|
| 1 | Linearity | Box-Tidwell Test | Log transformation applied to all predictors before Box-Tidwell | 4/4 variables pass (all p > 0.05) | Satisfied |
| 2 | Independence | Cross-sectional physical measurement design | Each row = unique individual; balanced sampling ensures no repeats | 0 duplicates; n = 115 | Satisfied |
| 3 | No Multicollinearity | VIF + Correlation Matrix | No high correlation detected; all VIF < 5 | All VIF < 5 (max = 1.78) | Satisfied |
| 4 | No Outliers | Mahalanobis Distance + IQR Capping | Log transform + Mahalanobis removal (97.5% cutoff) + IQR capping | 0 total outliers remaining | Satisfied |
| 5 | Multivariate Normality | Mardia’s Test (within-group) | Log transform + Mahalanobis + IQR capping + balanced n=40/class | Skewness: 3/3 pass | Kurtosis: 3/3 pass | Satisfied |
| 6 | Homogeneity of Variance | Box’s M Test | Balanced stratified sampling (n=40/class) + log transformation | p-value = 0.5894 | Satisfied |
dataset_info <- data.frame(
Item = c(
"Dataset",
"Source",
"File",
"Continuous predictors",
"Ordinal predictor",
"Target variable",
"Sampling",
"Final n after preprocessing",
"Transformation"
),
Detail = c(
"Body Performance Dataset (South Korea)",
"Kaggle – kukuroo3/body-performance-data",
"bodyPerformance.csv",
"height_cm, systolic, sit_ups_counts",
"age_group (1=20–29, 2=30–39, 3=40–49, 4=50+)",
"class — A (Best), B (Good), C (Average)",
paste0("Balanced: n = ", n_per_class, " per class"),
paste0(nrow(df_final), " rows"),
"Log transformation on all predictors"
)
)
knitr::kable(dataset_info, caption = "Dataset Information") %>%
kableExtra::kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Item | Detail |
|---|---|
| Dataset | Body Performance Dataset (South Korea) |
| Source | Kaggle – kukuroo3/body-performance-data |
| File | bodyPerformance.csv |
| Continuous predictors | height_cm, systolic, sit_ups_counts |
| Ordinal predictor | age_group (1=20–29, 2=30–39, 3=40–49, 4=50+) |
| Target variable | class — A (Best), B (Good), C (Average) |
| Sampling | Balanced: n = 40 per class |
| Final n after preprocessing | 115 rows |
| Transformation | Log transformation on all predictors |
## Call:
## lda(class ~ ., data = df_final[, c(all_pred, "class")])
##
## Prior probabilities of groups:
## A B C
## 0.3391304 0.3304348 0.3304348
##
## Group means:
## height_cm systolic sit_ups_counts age_group
## A 5.137017 4.839789 3.899434 0.5244774
## B 5.127498 4.907560 3.781667 0.5031694
## C 5.136866 4.870337 3.701788 0.5885305
##
## Coefficients of linear discriminants:
## LD1 LD2
## height_cm -10.721658 -15.5724229
## systolic -4.481219 7.3116883
## sit_ups_counts 6.538989 1.7860635
## age_group 1.150229 -0.6714216
##
## Proportion of trace:
## LD1 LD2
## 0.8306 0.1694
The LDA model was used to classify the data into classes A, B, and C. The nearly identical prior probability values (~0.33) indicate that the data is balanced, so the model is unbiased. The averages for each group show distinct characteristics, particularly higher sit_ups_counts in class A (best performance) and a tendency for higher age_group values in class C. The discriminant functions consist of LD1 (83.06%) and LD2 (16.94%), making LD1 the primary discriminator between classes. The most influential variables:
df_final$LD1 <- lda_pred$x[,1]
df_final$LD2 <- lda_pred$x[,2]
ggplot(df_final, aes(x = LD1, y = LD2, color = class)) +
geom_point(size = 3, alpha = 0.8) +
labs(title = "LDA Projection (LD1 vs LD2)") +
theme_minimal()## Predicted
## Actual A B C
## A 25 7 7
## B 12 18 8
## C 6 12 20
The confusion matrix results show the model’s performance in classifying each class. For class A, out of 39 data points, 25 were correctly classified, while the rest were incorrectly predicted as B (7) and C (7). For class B, only 18 were correct, while a significant number were incorrectly classified as A (12) and C (8). For class C, there were 20 correct predictions, with errors classified as A (6) and B (12). Some key points:
## [1] 0.5478261
## [1] 0.4521739
An accuracy score of 0.5478 (54.78%) indicates that the model is only able to correctly classify about half of the data. Meanwhile, an APER score of 0.4522 (45.22%) indicates that the classification error rate remains quite high.
lda_cv <- lda(class ~ .,
data = df_final[, c(all_pred, "class")],
CV = TRUE)
conf_cv <- table(
Actual = df_final$class,
Predicted = lda_cv$class
)
conf_cv## Predicted
## Actual A B C
## A 24 8 7
## B 13 16 9
## C 8 11 19
## [1] 0.5130435
## [1] 0.4869565
The cross-validation accuracy (51.30%) is slightly lower than the training accuracy (54.78%), indicating that the model has a consistent but moderate performance and does not suffer from severe overfitting.
df_final$class <- factor(df_final$class, levels = c("A","B","C"), ordered = TRUE)
library(MASS)
olr_model <- polr(class ~ .,
data = df_final[, c(all_pred, "class")],
Hess = TRUE)
summary(olr_model)## Call:
## polr(formula = class ~ ., data = df_final[, c(all_pred, "class")],
## Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## height_cm 13.265 4.8657 2.726
## systolic 3.536 1.7337 2.039
## sit_ups_counts -7.065 1.4051 -5.029
## age_group -1.201 0.4635 -2.591
##
## Intercepts:
## Value Std. Error t value
## A|B 57.0567 23.7331 2.4041
## B|C 58.7816 23.7697 2.4730
##
## Residual Deviance: 219.307
## AIC: 231.307
The Ordinal Logistic Regression (OLR) model is used to classify ordered categories (A > B > C). All variables have t-values greater than 2, indicating they contribute to the model.
The intercepts (A|B and B|C) represent the thresholds between categories, while AIC = 231.307 is used to evaluate model quality (lower is better).
coef_table <- coef(summary(olr_model))
p_values <- 2 * (1 - pnorm(abs(coef_table[, "t value"])))
coef_df <- cbind(coef_table, "p_value" = round(p_values, 4))
coef_df## Value Std. Error t value p_value
## height_cm 13.264991 4.8657099 2.726219 0.0064
## systolic 3.535704 1.7337423 2.039348 0.0414
## sit_ups_counts -7.065483 1.4050502 -5.028634 0.0000
## age_group -1.201065 0.4635468 -2.591034 0.0096
## A|B 57.056707 23.7331255 2.404096 0.0162
## B|C 58.781563 23.7696593 2.472966 0.0134
The coefficient table shows that all predictors are statistically significant (p-value < 0.05), indicating that they have a meaningful effect on the ordinal classification.
The intercepts (A|B and B|C) are also significant, confirming that the thresholds between categories are well defined.
## Predicted
## Actual A B C
## A 27 10 2
## B 11 16 11
## C 5 14 19
The confusion matrix shows the classification performance of the OLR model.
Overall, classes A and C are better classified, while class B is the most difficult to distinguish due to overlap with both A and C.
## [1] 0.5391304
## [1] 0.4608696
The OLR model achieves an accuracy of 0.5391 (53.91%), indicating that slightly more than half of the observations are correctly classified. The APER value of 0.4609 (46.09%) shows that the misclassification rate is still relatively high.
prob_olr <- predict(olr_model, type = "probs")
prob_df <- as.data.frame(prob_olr)
prob_df$class <- df_final$class
library(tidyr)
prob_long <- pivot_longer(prob_df,
cols = c("A","B","C"),
names_to = "Predicted_Class",
values_to = "Probability")
library(ggplot2)
ggplot(prob_long, aes(x = Predicted_Class, y = Probability, fill = class)) +
geom_boxplot(alpha = 0.7) +
theme_minimal() +
labs(title = "Predicted Probabilities by Actual Class",
x = "Predicted Class",
y = "Probability")model_comp <- data.frame(
Model = c("LDA", "Ordinal Logistic Regression"),
Accuracy = c(round(accuracy, 4), round(acc_olr, 4)),
APER = c(round(aper, 4), round(aper_olr, 4))
)
library(kableExtra)
kable(model_comp, caption = "Comparison of LDA and OLR Models") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Model | Accuracy | APER |
|---|---|---|
| LDA | 0.5478 | 0.4522 |
| Ordinal Logistic Regression | 0.5391 | 0.4609 |
The comparison table shows that the LDA model achieves slightly higher accuracy than the OLR model. However, both models have very similar performance, with accuracy around 54% and relatively high error rates. While LDA performs marginally better, the OLR model is more appropriate for this dataset because it accounts for the ordinal nature of the response variable. Overall, the results suggest that there is considerable overlap between classes, which limits the classification performance of both models.
Prepared as part of the Multivariate Analysis and Generalized Linear Models Practicum – Data Science Program, Universitas Negeri Surabaya