1 Introduction

This dataset is used to analyze obesity levels (NObeyesdad), which consist of seven ordinal categories: Insufficient Weight < Normal Weight < Overweight Level I < Overweight Level II < Obesity Type I < Obesity Type II < Obesity Type III

Since the response variable has an inherent order (ordinal), the method used is Ordinal Logistic Regression, also known as the Proportional Odds Model.

In practice, this model requires several assumptions to be checked, including:

Linearity: the relationship between log-odds and continuous variables is linear
Independence: each observation is independent
No Multicollinearity: no high correlation among predictors
No Outliers: no extreme values in continuous variables

It is important to note that the linearity and outlier assumptions apply only to numerical variables. Therefore, this dataset includes:

Ordinal target variable: NObeyesdad (7 categories)
Continuous variables: Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
Categorical variables: Gender, FAVC, CAEC, SMOKE, and others

2 Load Library and Data

pkgs <- c("MASS", "car", "ggplot2", "tidyverse", "corrplot", "gridExtra", "nnet")
for (p in pkgs) if (!require(p, character.only = TRUE)) install.packages(p)

library(MASS)
library(car)
library(ggplot2)
library(tidyverse)
library(corrplot)
library(gridExtra)
library(nnet)

fix_numeric <- function(x) {
  sapply(x, function(v) {
    v <- trimws(v)
    n_dots <- nchar(v) - nchar(gsub("\\.", "", v))
    if (n_dots > 1) {
      parts <- strsplit(v, "\\.")[[1]]
      v <- paste0(parts[1], ".", paste(parts[-1], collapse = ""))
    }
    suppressWarnings(as.numeric(v))
  })
}

df_raw <- read.csv("ObesityDataSet.csv",
                   sep = ";", stringsAsFactors = FALSE)

cont_vars <- c("Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE")
cat_vars  <- c("Gender", "family_history_with_overweight", "FAVC",
               "CAEC", "SMOKE", "SCC", "CALC", "MTRANS")

df <- df_raw
for (v in cont_vars) df[[v]] <- fix_numeric(df_raw[[v]])
for (v in cat_vars)  df[[v]] <- as.factor(df[[v]])

df$NObeyesdad <- factor(df$NObeyesdad,
  levels = c("Insufficient_Weight", "Normal_Weight",
             "Overweight_Level_I",  "Overweight_Level_II",
             "Obesity_Type_I",      "Obesity_Type_II", "Obesity_Type_III"),
  ordered = TRUE)

cat("Data dimensions:", nrow(df), "rows x", ncol(df), "columns\n")

## Data dimensions: 2111 rows x 17 columns

cat("\nTarget class distribution:\n")

## 
## Target class distribution:

print(table(df$NObeyesdad))

## 
## Insufficient_Weight       Normal_Weight  Overweight_Level_I Overweight_Level_II 
##                 272                 287                 290                 290 
##      Obesity_Type_I     Obesity_Type_II    Obesity_Type_III 
##                 351                 297                 324

cat("\nMissing values per continuous variable:\n")

## 
## Missing values per continuous variable:

print(colSums(is.na(df[, cont_vars])))

##    Age Height Weight   FCVC    NCP   CH2O    FAF    TUE 
##      0      0      0      0      0      0      0      0

glimpse(df)

## Rows: 2,111
## Columns: 17
## $ Gender                         <fct> Female, Female, Male, Male, Male, Male,…
## $ Age                            <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,…
## $ Height                         <dbl> 1.62, 1.52, 1.80, 1.80, 1.78, 1.62, 1.5…
## $ Weight                         <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.…
## $ family_history_with_overweight <fct> yes, yes, yes, no, no, no, yes, no, yes…
## $ FAVC                           <fct> no, no, no, no, no, yes, yes, no, yes, …
## $ FCVC                           <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, …
## $ NCP                            <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ CAEC                           <fct> Sometimes, Sometimes, Sometimes, Someti…
## $ SMOKE                          <fct> no, yes, no, no, no, no, no, no, no, no…
## $ CH2O                           <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
## $ SCC                            <fct> no, yes, no, no, no, no, no, no, no, no…
## $ FAF                            <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2, …
## $ TUE                            <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, …
## $ CALC                           <fct> no, Sometimes, Frequently, Frequently, …
## $ MTRANS                         <fct> Public_Transportation, Public_Transport…
## $ NObeyesdad                     <ord> Normal_Weight, Normal_Weight, Normal_We…

3 Assumption 1: Linearity

3.1 Concept

The linearity assumption requires an approximately linear relationship between the log-odds and the continuous predictor variables. To assess this, the Box-Tidwell test is used by adding an interaction term of the form X * ln(X). If this interaction term is not significant (p > 0.05), the relationship can be considered linear.

3.2 Data Transformation

For variables such as Age, Weight, and Height, the relationship with log-odds is often not directly linear. Therefore, logarithmic transformation is applied to help stabilize variance and reduce the influence of extreme values.

The transformations used are:

For variables that are always positive → log(x)
For variables that may contain zero values (e.g., FAF and TUE) → log(x + 1)

This approach is expected to make the relationship between continuous variables and log-odds closer to linear.

3.3 Log Transformation for Continuous Variables

vars_log1p <- c("FAF", "TUE") 
vars_log <- setdiff(cont_vars, vars_log1p)

df_t <- df

# Apply transformation
for (v in vars_log) {
df_t[[v]] <- log(df_t[[v]])
}

for (v in vars_log1p) {
df_t[[v]] <- log1p(df_t[[v]])
}

# Summary of transformation results
cat("Transformation completed\n")

## Transformation completed

cat("Variables log(x) :", paste(vars_log, collapse = ", "), "\n")

## Variables log(x) : Age, Height, Weight, FCVC, NCP, CH2O

cat("Variables log(x+1) :", paste(vars_log1p, collapse = ", "), "\n\n")

## Variables log(x+1) : FAF, TUE

# Check for invalid values
check <- sapply(cont_vars, function(v) sum(!is.finite(df_t[[v]])))
cat("Number of invalid values (should be 0):\n")

## Number of invalid values (should be 0):

print(check)

##    Age Height Weight   FCVC    NCP   CH2O    FAF    TUE 
##      0      0      0      0      0      0      0      0

3.4 Box-Tidwell Test

df_bt <- df_t[complete.cases(df_t[, cont_vars]), ]

# Create interaction terms X * ln(X)

for (v in cont_vars) {
x <- df_bt[[v]]

# Avoid values <= 0 to ensure log is defined

if (any(x <= 0, na.rm = TRUE)) {
x <- x - min(x, na.rm = TRUE) + 0.001
}

df_bt[[paste0(v, "_ln")]] <- x * log(x)
}

# Convert target to unordered (requirement for Box-Tidwell)

df_bt$target_unord <- relevel(
factor(as.character(df_bt$NObeyesdad)),
ref = "Normal_Weight"
)

# Define model formula

bt_formula <- as.formula(paste(
"target_unord ~",
paste(cont_vars, collapse = " + "), "+",
paste(paste0(cont_vars, "_ln"), collapse = " + ")
))

set.seed(42)
model_bt <- multinom(bt_formula, data = df_bt, trace = FALSE, maxit = 500)

# Compute p-values

z_bt <- summary(model_bt)$coefficients / summary(model_bt)$standard.errors
p_bt <- 2 * (1 - pnorm(abs(z_bt)))

ln_terms <- paste0(cont_vars, "_ln")
p_ln <- apply(p_bt[, ln_terms, drop = FALSE], 2, min)

# Summary of results

bt_result <- data.frame(
Variable = cont_vars,
Transformation = c("log(Age)", "log(Height)", "log(Weight)",
"log(FCVC)", "log(NCP)", "log(CH2O)",
"log1p(FAF)", "log1p(TUE)"),
p_value = round(p_ln, 4),
Interpretation = ifelse(p_ln > 0.05, "Satisfied", "Not satisfied")
)

rownames(bt_result) <- NULL
knitr::kable(bt_result, caption = "Box-Tidwell Test Results")

Box-Tidwell Test Results
Variable	Transformation	p_value	Interpretation
Age	log(Age)	0.1531	Satisfied
Height	log(Height)	0.0008	Not satisfied
Weight	log(Weight)	0.0000	Not satisfied
FCVC	log(FCVC)	0.0000	Not satisfied
NCP	log(NCP)	0.0000	Not satisfied
CH2O	log(CH2O)	0.0389	Not satisfied
FAF	log1p(FAF)	0.0000	Not satisfied
TUE	log1p(TUE)	0.0000	Not satisfied

Interpretation: The Box-Tidwell test results indicate that not all variables satisfy the linearity assumption (p > 0.05). This suggests that the relationship between some continuous variables and the log-odds is not fully linear, even after applying log transformation. This condition is common in real-world data. Therefore, the model can still be used.

4 Assumption 2: Independence

n_obs <- nrow(df_t)
n_dup <- sum(duplicated(df_t[, c(cont_vars, cat_vars)]))

cat("Number of observations :", n_obs, "\n")

## Number of observations : 2111

cat("Number of duplicates   :", n_dup, "\n\n")

## Number of duplicates   : 24

# Removing duplicates
df_t_clean <- df_t[!duplicated(df_t[, c(cont_vars, cat_vars)]), ]

cat("Number of observations After removing duplicates :", nrow(df_t_clean), "\n\n")

## Number of observations After removing duplicates : 2087

Interpretation: The dataset is cross-sectional, where each observation represents a different individual. A small number of duplicate records were identified and subsequently removed. After cleaning, no repeated observations remain, so the independence assumption can be considered satisfied.

5 Assumption 3: No Multicollinearity

Multicollinearity occurs when some predictor variables are highly correlated with each other, which can lead to instability in the estimation of model coefficients.

To detect this issue, two approaches are used:

Correlation matrix to examine relationships among continuous variables
Variance Inflation Factor (VIF) to measure the level of collinearity

As a rule of thumb, a VIF value above 10 indicates a serious problem, while values below 5 are generally considered acceptable.

5.1 Correlation Matrix

cor_mat <- cor(df_t[, cont_vars], use = "complete.obs")

corrplot(cor_mat,
         method = "color",
         type   = "upper",
         addCoef.col = "black",
         number.cex  = 0.7,
         tl.cex      = 0.8,
         col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
         title = "Correlation Matrix (After Log Transformation)",
         mar = c(0, 0, 2, 0))

5.2 VIF

df_vif <- df_t
df_vif$NObeyesdad_num <- as.numeric(df_t$NObeyesdad)

vif_formula <- as.formula(paste(
"NObeyesdad_num ~",
paste(cont_vars, collapse = " + "),
"+ Gender + family_history_with_overweight + FAVC +",
"CAEC + SMOKE + SCC + CALC + MTRANS"
))

lm_vif  <- lm(vif_formula, data = df_vif)
vif_val <- vif(lm_vif)

vif_df <- data.frame(
Variable = names(vif_val[, 1]),
VIF      = round(vif_val[, 1], 3),
Interpretation = ifelse(vif_val[, 1] < 5,  "Safe",
ifelse(vif_val[, 1] < 10, "Moderate concern", "High"))
)

rownames(vif_df) <- NULL
knitr::kable(vif_df, caption = "VIF Values of Variables")

VIF Values of Variables
Variable	VIF	Interpretation
Age	1.061	Safe
Height	1.020	Safe
Weight	1.075	Safe
FCVC	1.020	Safe
NCP	1.030	Safe
CH2O	1.026	Safe
FAF	1.030	Safe
TUE	1.044	Safe
Gender	1.098	Safe
family_history_with_overweight	1.249	Safe
FAVC	1.163	Safe
CAEC	1.311	Safe
SMOKE	1.030	Safe
SCC	1.090	Safe
CALC	1.142	Safe
MTRANS	1.227	Safe

Interpretation: Based on the VIF results, all variables have values below 5. This indicates that there is no high correlation among predictors that could affect the stability of the model. Therefore, the assumption of no multicollinearity can be considered satisfied.

6 Assumption 4: No Outliers

Logistic regression models are quite sensitive to the presence of outliers, especially in continuous variables. Therefore, it is important to handle extreme values to prevent distortion in the model results.

The approaches used in this analysis are:

Log transformation to reduce the influence of extreme values
Capping (winsorizing) based on IQR thresholds to limit outlier values

6.1 Outlier Detection Using IQR Method

detect_outliers <- function(x, var_name) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1

lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val

n_out <- sum(x < lower | x > upper, na.rm = TRUE)

data.frame(
Variable = var_name,
Lower = round(lower, 3),
Upper = round(upper, 3),
Outliers = n_out,
Percent = round(n_out / length(x) * 100, 2)
)
}

outlier_tbl <- do.call(rbind,
lapply(cont_vars, function(v) detect_outliers(df_t[[v]], v)))

rownames(outlier_tbl) <- NULL
knitr::kable(outlier_tbl, caption = "Outliers After Log Transformation")

Outliers After Log Transformation
Variable	Lower	Upper	Outliers	Percent
Age	2.490	3.719	155	7.34
Height	0.356	0.715	157	7.44
Weight	3.254	5.494	138	6.54
FCVC	0.085	1.707	139	6.58
NCP	0.894	1.221	670	31.74
CH2O	-0.216	1.670	109	5.16
FAF	-1.251	2.398	76	3.60
TUE	-1.040	1.733	35	1.66

Interpretation: After the log transformation, the number of outliers in each variable has decreased compared to the original data. However, there are still some extreme values that need to be handled further.

6.2 Boxplot Before Capping

plots_before <- lapply(cont_vars, function(v) {
ggplot(df_t, aes_string(y = v)) +
geom_boxplot(fill = "#4292C6",
outlier.color = "red",
outlier.size = 1.2) +
labs(title = v, x = NULL, y = NULL) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10))
})

grid.arrange(grobs = plots_before, ncol = 4,
top = "Boxplot Before Capping (After Log Transformation)")

6.3 Outlier Capping Using IQR Method

cap_outlier <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1

lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val

x[x < lower] <- lower
x[x > upper] <- upper

x
}

df_clean <- df_t_clean
for (v in cont_vars) {
df_clean[[v]] <- cap_outlier(df_clean[[v]])
}


## Comparison of outliers before and after capping

for (v in cont_vars) {
Q1 <- quantile(df_t[[v]], 0.25)
Q3 <- quantile(df_t[[v]], 0.75)
n_before <- sum(df_t[[v]] < Q1 - 1.5*IQR(df_t[[v]]) |
df_t[[v]] > Q3 + 1.5*IQR(df_t[[v]]))

Q1c <- quantile(df_clean[[v]], 0.25)
Q3c <- quantile(df_clean[[v]], 0.75)
n_after <- sum(df_clean[[v]] < Q1c - 1.5*IQR(df_clean[[v]]) |
df_clean[[v]] > Q3c + 1.5*IQR(df_clean[[v]]))

cat(sprintf("%-10s : %3d → %d\n", v, n_before, n_after))
}

## Age        : 155 → 0
## Height     : 157 → 0
## Weight     : 138 → 0
## FCVC       : 139 → 0
## NCP        : 670 → 0
## CH2O       : 109 → 0
## FAF        :  76 → 0
## TUE        :  35 → 0

6.4 Boxplot After Capping

plots_after <- lapply(cont_vars, function(v) {
ggplot(df_clean, aes_string(y = v)) +
geom_boxplot(fill = "#74C476",
outlier.color = "red",
outlier.size = 1.2) +
labs(title = v, x = NULL, y = NULL) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10))
})

grid.arrange(grobs = plots_after, ncol = 4,
top = "Boxplot After Capping")

Interpretation: After applying capping, extreme values have been successfully limited within the IQR range. This reduces the influence of outliers and makes the data distribution more stable. Therefore, the impact of outliers has been effectively handled.

7 Summary of Assumption Testing

summary_assumptions <- data.frame(
No     = 1:4,
Assumption = c("Linearity", "Independence", "No Multicollinearity", "No Outliers"),
Method = c("Box-Tidwell",
"Data structure check",
"Correlation & VIF",
"IQR + Winsorizing"),
Treatment = c("Log/log1p transformation applied to approximate linearity",
"No special treatment",
"No special treatment",
"Log transformation followed by capping"),
Status = c("Partially satisfied",
"Satisfied",
"Satisfied",
"Satisfied after treatment")
)

knitr::kable(summary_assumptions,
caption = "Summary of Assumption Testing for Ordinal Logistic Regression Model")

Summary of Assumption Testing for Ordinal Logistic Regression Model
No	Assumption	Method	Treatment	Status
1	Linearity	Box-Tidwell	Log/log1p transformation applied to approximate linearity	Partially satisfied
2	Independence	Data structure check	No special treatment	Satisfied
3	No Multicollinearity	Correlation & VIF	No special treatment	Satisfied
4	No Outliers	IQR + Winsorizing	Log transformation followed by capping	Satisfied after treatment

cat(
"The preprocessed dataset contains", nrow(df_clean), "observations and",
ncol(df_clean), "variables.\n\n",

"The distribution of obesity level categories shows that all classes remain well represented:\n"
)

## The preprocessed dataset contains 2087 observations and 17 variables.
## 
##  The distribution of obesity level categories shows that all classes remain well represented:

print(table(df_clean$NObeyesdad))

## 
## Insufficient_Weight       Normal_Weight  Overweight_Level_I Overweight_Level_II 
##                 267                 282                 276                 290 
##      Obesity_Type_I     Obesity_Type_II    Obesity_Type_III 
##                 351                 297                 324

All continuous variables have undergone log (or log1p) transformation to improve relationship patterns and outlier handling using the IQR-based capping method, resulting in a more stable dataset that is ready for Ordinal Logistic Regression modeling.

Prepared for the Generalized Linear Models practicum – Data Science, UNESA

Assumption Testing of Ordinal Logistic Regression – Obesity Dataset

Halilatunnisa (130) & Faishal Muflih Irfanu Tsaqib (170)

2026-04-25