1 Introduction

This dataset is used to analyze obesity levels (NObeyesdad), which consist of seven ordinal categories: Insufficient Weight < Normal Weight < Overweight Level I < Overweight Level II < Obesity Type I < Obesity Type II < Obesity Type III

Since the response variable has an inherent order (ordinal), the method used is Ordinal Logistic Regression, also known as the Proportional Odds Model.

In practice, this model requires several assumptions to be checked, including:

  1. Linearity: the relationship between log-odds and continuous variables is linear
  2. Independence: each observation is independent
  3. No Multicollinearity: no high correlation among predictors
  4. No Outliers: no extreme values in continuous variables

It is important to note that the linearity and outlier assumptions apply only to numerical variables. Therefore, this dataset includes:

  • Ordinal target variable: NObeyesdad (7 categories)
  • Continuous variables: Age, Height, Weight, FCVC, NCP, CH2O, FAF, TUE
  • Categorical variables: Gender, FAVC, CAEC, SMOKE, and others

2 Load Library and Data

pkgs <- c("MASS", "car", "ggplot2", "tidyverse", "corrplot", "gridExtra", "nnet")
for (p in pkgs) if (!require(p, character.only = TRUE)) install.packages(p)

library(MASS)
library(car)
library(ggplot2)
library(tidyverse)
library(corrplot)
library(gridExtra)
library(nnet)

fix_numeric <- function(x) {
  sapply(x, function(v) {
    v <- trimws(v)
    n_dots <- nchar(v) - nchar(gsub("\\.", "", v))
    if (n_dots > 1) {
      parts <- strsplit(v, "\\.")[[1]]
      v <- paste0(parts[1], ".", paste(parts[-1], collapse = ""))
    }
    suppressWarnings(as.numeric(v))
  })
}

df_raw <- read.csv("ObesityDataSet.csv",
                   sep = ";", stringsAsFactors = FALSE)

cont_vars <- c("Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE")
cat_vars  <- c("Gender", "family_history_with_overweight", "FAVC",
               "CAEC", "SMOKE", "SCC", "CALC", "MTRANS")

df <- df_raw
for (v in cont_vars) df[[v]] <- fix_numeric(df_raw[[v]])
for (v in cat_vars)  df[[v]] <- as.factor(df[[v]])

df$NObeyesdad <- factor(df$NObeyesdad,
  levels = c("Insufficient_Weight", "Normal_Weight",
             "Overweight_Level_I",  "Overweight_Level_II",
             "Obesity_Type_I",      "Obesity_Type_II", "Obesity_Type_III"),
  ordered = TRUE)

cat("Data dimensions:", nrow(df), "rows x", ncol(df), "columns\n")
## Data dimensions: 2111 rows x 17 columns
cat("\nTarget class distribution:\n")
## 
## Target class distribution:
print(table(df$NObeyesdad))
## 
## Insufficient_Weight       Normal_Weight  Overweight_Level_I Overweight_Level_II 
##                 272                 287                 290                 290 
##      Obesity_Type_I     Obesity_Type_II    Obesity_Type_III 
##                 351                 297                 324
cat("\nMissing values per continuous variable:\n")
## 
## Missing values per continuous variable:
print(colSums(is.na(df[, cont_vars])))
##    Age Height Weight   FCVC    NCP   CH2O    FAF    TUE 
##      0      0      0      0      0      0      0      0
glimpse(df)
## Rows: 2,111
## Columns: 17
## $ Gender                         <fct> Female, Female, Male, Male, Male, Male,…
## $ Age                            <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,…
## $ Height                         <dbl> 1.62, 1.52, 1.80, 1.80, 1.78, 1.62, 1.5…
## $ Weight                         <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.…
## $ family_history_with_overweight <fct> yes, yes, yes, no, no, no, yes, no, yes…
## $ FAVC                           <fct> no, no, no, no, no, yes, yes, no, yes, …
## $ FCVC                           <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, …
## $ NCP                            <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ CAEC                           <fct> Sometimes, Sometimes, Sometimes, Someti…
## $ SMOKE                          <fct> no, yes, no, no, no, no, no, no, no, no…
## $ CH2O                           <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
## $ SCC                            <fct> no, yes, no, no, no, no, no, no, no, no…
## $ FAF                            <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2, …
## $ TUE                            <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, …
## $ CALC                           <fct> no, Sometimes, Frequently, Frequently, …
## $ MTRANS                         <fct> Public_Transportation, Public_Transport…
## $ NObeyesdad                     <ord> Normal_Weight, Normal_Weight, Normal_We…

3 Assumption 1: Linearity

3.1 Concept

The linearity assumption requires an approximately linear relationship between the log-odds and the continuous predictor variables. To assess this, the Box-Tidwell test is used by adding an interaction term of the form X * ln(X). If this interaction term is not significant (p > 0.05), the relationship can be considered linear.

3.2 Data Transformation

For variables such as Age, Weight, and Height, the relationship with log-odds is often not directly linear. Therefore, logarithmic transformation is applied to help stabilize variance and reduce the influence of extreme values.

The transformations used are:

  • For variables that are always positive → log(x)
  • For variables that may contain zero values (e.g., FAF and TUE) → log(x + 1)

This approach is expected to make the relationship between continuous variables and log-odds closer to linear.

3.3 Log Transformation for Continuous Variables

vars_log1p <- c("FAF", "TUE") 
vars_log <- setdiff(cont_vars, vars_log1p)

df_t <- df

# Apply transformation
for (v in vars_log) {
df_t[[v]] <- log(df_t[[v]])
}

for (v in vars_log1p) {
df_t[[v]] <- log1p(df_t[[v]])
}

# Summary of transformation results
cat("Transformation completed\n")
## Transformation completed
cat("Variables log(x) :", paste(vars_log, collapse = ", "), "\n")
## Variables log(x) : Age, Height, Weight, FCVC, NCP, CH2O
cat("Variables log(x+1) :", paste(vars_log1p, collapse = ", "), "\n\n")
## Variables log(x+1) : FAF, TUE
# Check for invalid values
check <- sapply(cont_vars, function(v) sum(!is.finite(df_t[[v]])))
cat("Number of invalid values (should be 0):\n")
## Number of invalid values (should be 0):
print(check)
##    Age Height Weight   FCVC    NCP   CH2O    FAF    TUE 
##      0      0      0      0      0      0      0      0

3.4 Box-Tidwell Test

df_bt <- df_t[complete.cases(df_t[, cont_vars]), ]

# Create interaction terms X * ln(X)

for (v in cont_vars) {
x <- df_bt[[v]]

# Avoid values <= 0 to ensure log is defined

if (any(x <= 0, na.rm = TRUE)) {
x <- x - min(x, na.rm = TRUE) + 0.001
}

df_bt[[paste0(v, "_ln")]] <- x * log(x)
}

# Convert target to unordered (requirement for Box-Tidwell)

df_bt$target_unord <- relevel(
factor(as.character(df_bt$NObeyesdad)),
ref = "Normal_Weight"
)

# Define model formula

bt_formula <- as.formula(paste(
"target_unord ~",
paste(cont_vars, collapse = " + "), "+",
paste(paste0(cont_vars, "_ln"), collapse = " + ")
))

set.seed(42)
model_bt <- multinom(bt_formula, data = df_bt, trace = FALSE, maxit = 500)

# Compute p-values

z_bt <- summary(model_bt)$coefficients / summary(model_bt)$standard.errors
p_bt <- 2 * (1 - pnorm(abs(z_bt)))

ln_terms <- paste0(cont_vars, "_ln")
p_ln <- apply(p_bt[, ln_terms, drop = FALSE], 2, min)

# Summary of results

bt_result <- data.frame(
Variable = cont_vars,
Transformation = c("log(Age)", "log(Height)", "log(Weight)",
"log(FCVC)", "log(NCP)", "log(CH2O)",
"log1p(FAF)", "log1p(TUE)"),
p_value = round(p_ln, 4),
Interpretation = ifelse(p_ln > 0.05, "Satisfied", "Not satisfied")
)

rownames(bt_result) <- NULL
knitr::kable(bt_result, caption = "Box-Tidwell Test Results")
Box-Tidwell Test Results
Variable Transformation p_value Interpretation
Age log(Age) 0.1531 Satisfied
Height log(Height) 0.0008 Not satisfied
Weight log(Weight) 0.0000 Not satisfied
FCVC log(FCVC) 0.0000 Not satisfied
NCP log(NCP) 0.0000 Not satisfied
CH2O log(CH2O) 0.0389 Not satisfied
FAF log1p(FAF) 0.0000 Not satisfied
TUE log1p(TUE) 0.0000 Not satisfied

Interpretation: The Box-Tidwell test results indicate that not all variables satisfy the linearity assumption (p > 0.05). This suggests that the relationship between some continuous variables and the log-odds is not fully linear, even after applying log transformation. This condition is common in real-world data. Therefore, the model can still be used.


4 Assumption 2: Independence

n_obs <- nrow(df_t)
n_dup <- sum(duplicated(df_t[, c(cont_vars, cat_vars)]))

cat("Number of observations :", n_obs, "\n")
## Number of observations : 2111
cat("Number of duplicates   :", n_dup, "\n\n")
## Number of duplicates   : 24
# Removing duplicates
df_t_clean <- df_t[!duplicated(df_t[, c(cont_vars, cat_vars)]), ]

cat("Number of observations After removing duplicates :", nrow(df_t_clean), "\n\n")
## Number of observations After removing duplicates : 2087

Interpretation: The dataset is cross-sectional, where each observation represents a different individual. A small number of duplicate records were identified and subsequently removed. After cleaning, no repeated observations remain, so the independence assumption can be considered satisfied.


5 Assumption 3: No Multicollinearity

Multicollinearity occurs when some predictor variables are highly correlated with each other, which can lead to instability in the estimation of model coefficients.

To detect this issue, two approaches are used:

  • Correlation matrix to examine relationships among continuous variables
  • Variance Inflation Factor (VIF) to measure the level of collinearity

As a rule of thumb, a VIF value above 10 indicates a serious problem, while values below 5 are generally considered acceptable.

5.1 Correlation Matrix

cor_mat <- cor(df_t[, cont_vars], use = "complete.obs")

corrplot(cor_mat,
         method = "color",
         type   = "upper",
         addCoef.col = "black",
         number.cex  = 0.7,
         tl.cex      = 0.8,
         col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
         title = "Correlation Matrix (After Log Transformation)",
         mar = c(0, 0, 2, 0))

5.2 VIF

df_vif <- df_t
df_vif$NObeyesdad_num <- as.numeric(df_t$NObeyesdad)

vif_formula <- as.formula(paste(
"NObeyesdad_num ~",
paste(cont_vars, collapse = " + "),
"+ Gender + family_history_with_overweight + FAVC +",
"CAEC + SMOKE + SCC + CALC + MTRANS"
))

lm_vif  <- lm(vif_formula, data = df_vif)
vif_val <- vif(lm_vif)

vif_df <- data.frame(
Variable = names(vif_val[, 1]),
VIF      = round(vif_val[, 1], 3),
Interpretation = ifelse(vif_val[, 1] < 5,  "Safe",
ifelse(vif_val[, 1] < 10, "Moderate concern", "High"))
)

rownames(vif_df) <- NULL
knitr::kable(vif_df, caption = "VIF Values of Variables")
VIF Values of Variables
Variable VIF Interpretation
Age 1.061 Safe
Height 1.020 Safe
Weight 1.075 Safe
FCVC 1.020 Safe
NCP 1.030 Safe
CH2O 1.026 Safe
FAF 1.030 Safe
TUE 1.044 Safe
Gender 1.098 Safe
family_history_with_overweight 1.249 Safe
FAVC 1.163 Safe
CAEC 1.311 Safe
SMOKE 1.030 Safe
SCC 1.090 Safe
CALC 1.142 Safe
MTRANS 1.227 Safe

Interpretation: Based on the VIF results, all variables have values below 5. This indicates that there is no high correlation among predictors that could affect the stability of the model. Therefore, the assumption of no multicollinearity can be considered satisfied.


6 Assumption 4: No Outliers

Logistic regression models are quite sensitive to the presence of outliers, especially in continuous variables. Therefore, it is important to handle extreme values to prevent distortion in the model results.

The approaches used in this analysis are:

  • Log transformation to reduce the influence of extreme values
  • Capping (winsorizing) based on IQR thresholds to limit outlier values

6.1 Outlier Detection Using IQR Method

detect_outliers <- function(x, var_name) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1

lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val

n_out <- sum(x < lower | x > upper, na.rm = TRUE)

data.frame(
Variable = var_name,
Lower = round(lower, 3),
Upper = round(upper, 3),
Outliers = n_out,
Percent = round(n_out / length(x) * 100, 2)
)
}

outlier_tbl <- do.call(rbind,
lapply(cont_vars, function(v) detect_outliers(df_t[[v]], v)))

rownames(outlier_tbl) <- NULL
knitr::kable(outlier_tbl, caption = "Outliers After Log Transformation")
Outliers After Log Transformation
Variable Lower Upper Outliers Percent
Age 2.490 3.719 155 7.34
Height 0.356 0.715 157 7.44
Weight 3.254 5.494 138 6.54
FCVC 0.085 1.707 139 6.58
NCP 0.894 1.221 670 31.74
CH2O -0.216 1.670 109 5.16
FAF -1.251 2.398 76 3.60
TUE -1.040 1.733 35 1.66

Interpretation: After the log transformation, the number of outliers in each variable has decreased compared to the original data. However, there are still some extreme values that need to be handled further.

6.2 Boxplot Before Capping

plots_before <- lapply(cont_vars, function(v) {
ggplot(df_t, aes_string(y = v)) +
geom_boxplot(fill = "#4292C6",
outlier.color = "red",
outlier.size = 1.2) +
labs(title = v, x = NULL, y = NULL) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10))
})

grid.arrange(grobs = plots_before, ncol = 4,
top = "Boxplot Before Capping (After Log Transformation)")

6.3 Outlier Capping Using IQR Method

cap_outlier <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1

lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val

x[x < lower] <- lower
x[x > upper] <- upper

x
}

df_clean <- df_t_clean
for (v in cont_vars) {
df_clean[[v]] <- cap_outlier(df_clean[[v]])
}


## Comparison of outliers before and after capping

for (v in cont_vars) {
Q1 <- quantile(df_t[[v]], 0.25)
Q3 <- quantile(df_t[[v]], 0.75)
n_before <- sum(df_t[[v]] < Q1 - 1.5*IQR(df_t[[v]]) |
df_t[[v]] > Q3 + 1.5*IQR(df_t[[v]]))

Q1c <- quantile(df_clean[[v]], 0.25)
Q3c <- quantile(df_clean[[v]], 0.75)
n_after <- sum(df_clean[[v]] < Q1c - 1.5*IQR(df_clean[[v]]) |
df_clean[[v]] > Q3c + 1.5*IQR(df_clean[[v]]))

cat(sprintf("%-10s : %3d → %d\n", v, n_before, n_after))
}
## Age        : 155 → 0
## Height     : 157 → 0
## Weight     : 138 → 0
## FCVC       : 139 → 0
## NCP        : 670 → 0
## CH2O       : 109 → 0
## FAF        :  76 → 0
## TUE        :  35 → 0

6.4 Boxplot After Capping

plots_after <- lapply(cont_vars, function(v) {
ggplot(df_clean, aes_string(y = v)) +
geom_boxplot(fill = "#74C476",
outlier.color = "red",
outlier.size = 1.2) +
labs(title = v, x = NULL, y = NULL) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10))
})

grid.arrange(grobs = plots_after, ncol = 4,
top = "Boxplot After Capping")

Interpretation: After applying capping, extreme values have been successfully limited within the IQR range. This reduces the influence of outliers and makes the data distribution more stable. Therefore, the impact of outliers has been effectively handled.


7 Summary of Assumption Testing

summary_assumptions <- data.frame(
No     = 1:4,
Assumption = c("Linearity", "Independence", "No Multicollinearity", "No Outliers"),
Method = c("Box-Tidwell",
"Data structure check",
"Correlation & VIF",
"IQR + Winsorizing"),
Treatment = c("Log/log1p transformation applied to approximate linearity",
"No special treatment",
"No special treatment",
"Log transformation followed by capping"),
Status = c("Partially satisfied",
"Satisfied",
"Satisfied",
"Satisfied after treatment")
)

knitr::kable(summary_assumptions,
caption = "Summary of Assumption Testing for Ordinal Logistic Regression Model")
Summary of Assumption Testing for Ordinal Logistic Regression Model
No Assumption Method Treatment Status
1 Linearity Box-Tidwell Log/log1p transformation applied to approximate linearity Partially satisfied
2 Independence Data structure check No special treatment Satisfied
3 No Multicollinearity Correlation & VIF No special treatment Satisfied
4 No Outliers IQR + Winsorizing Log transformation followed by capping Satisfied after treatment

cat(
"The preprocessed dataset contains", nrow(df_clean), "observations and",
ncol(df_clean), "variables.\n\n",

"The distribution of obesity level categories shows that all classes remain well represented:\n"
)
## The preprocessed dataset contains 2087 observations and 17 variables.
## 
##  The distribution of obesity level categories shows that all classes remain well represented:
print(table(df_clean$NObeyesdad))
## 
## Insufficient_Weight       Normal_Weight  Overweight_Level_I Overweight_Level_II 
##                 267                 282                 276                 290 
##      Obesity_Type_I     Obesity_Type_II    Obesity_Type_III 
##                 351                 297                 324

All continuous variables have undergone log (or log1p) transformation to improve relationship patterns and outlier handling using the IQR-based capping method, resulting in a more stable dataset that is ready for Ordinal Logistic Regression modeling.


Prepared for the Generalized Linear Models practicum – Data Science, UNESA