This dataset is used to analyze obesity levels
(NObeyesdad), which consist of seven ordinal categories:
Insufficient Weight < Normal Weight < Overweight Level I <
Overweight Level II < Obesity Type I < Obesity Type II <
Obesity Type III
Since the response variable has an inherent order (ordinal), the method used is Ordinal Logistic Regression, also known as the Proportional Odds Model.
In practice, this model requires several assumptions to be checked, including:
It is important to note that the linearity and outlier assumptions apply only to numerical variables. Therefore, this dataset includes:
pkgs <- c("MASS", "car", "ggplot2", "tidyverse", "corrplot", "gridExtra", "nnet")
for (p in pkgs) if (!require(p, character.only = TRUE)) install.packages(p)
library(MASS)
library(car)
library(ggplot2)
library(tidyverse)
library(corrplot)
library(gridExtra)
library(nnet)
fix_numeric <- function(x) {
sapply(x, function(v) {
v <- trimws(v)
n_dots <- nchar(v) - nchar(gsub("\\.", "", v))
if (n_dots > 1) {
parts <- strsplit(v, "\\.")[[1]]
v <- paste0(parts[1], ".", paste(parts[-1], collapse = ""))
}
suppressWarnings(as.numeric(v))
})
}
df_raw <- read.csv("ObesityDataSet.csv",
sep = ";", stringsAsFactors = FALSE)
cont_vars <- c("Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE")
cat_vars <- c("Gender", "family_history_with_overweight", "FAVC",
"CAEC", "SMOKE", "SCC", "CALC", "MTRANS")
df <- df_raw
for (v in cont_vars) df[[v]] <- fix_numeric(df_raw[[v]])
for (v in cat_vars) df[[v]] <- as.factor(df[[v]])
df$NObeyesdad <- factor(df$NObeyesdad,
levels = c("Insufficient_Weight", "Normal_Weight",
"Overweight_Level_I", "Overweight_Level_II",
"Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"),
ordered = TRUE)
cat("Data dimensions:", nrow(df), "rows x", ncol(df), "columns\n")## Data dimensions: 2111 rows x 17 columns
##
## Target class distribution:
##
## Insufficient_Weight Normal_Weight Overweight_Level_I Overweight_Level_II
## 272 287 290 290
## Obesity_Type_I Obesity_Type_II Obesity_Type_III
## 351 297 324
##
## Missing values per continuous variable:
## Age Height Weight FCVC NCP CH2O FAF TUE
## 0 0 0 0 0 0 0 0
## Rows: 2,111
## Columns: 17
## $ Gender <fct> Female, Female, Male, Male, Male, Male,…
## $ Age <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,…
## $ Height <dbl> 1.62, 1.52, 1.80, 1.80, 1.78, 1.62, 1.5…
## $ Weight <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.…
## $ family_history_with_overweight <fct> yes, yes, yes, no, no, no, yes, no, yes…
## $ FAVC <fct> no, no, no, no, no, yes, yes, no, yes, …
## $ FCVC <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, …
## $ NCP <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ CAEC <fct> Sometimes, Sometimes, Sometimes, Someti…
## $ SMOKE <fct> no, yes, no, no, no, no, no, no, no, no…
## $ CH2O <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
## $ SCC <fct> no, yes, no, no, no, no, no, no, no, no…
## $ FAF <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2, …
## $ TUE <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, …
## $ CALC <fct> no, Sometimes, Frequently, Frequently, …
## $ MTRANS <fct> Public_Transportation, Public_Transport…
## $ NObeyesdad <ord> Normal_Weight, Normal_Weight, Normal_We…
The linearity assumption requires an approximately linear relationship between the log-odds and the continuous predictor variables. To assess this, the Box-Tidwell test is used by adding an interaction term of the form X * ln(X). If this interaction term is not significant (p > 0.05), the relationship can be considered linear.
For variables such as Age, Weight, and Height, the relationship with log-odds is often not directly linear. Therefore, logarithmic transformation is applied to help stabilize variance and reduce the influence of extreme values.
The transformations used are:
This approach is expected to make the relationship between continuous variables and log-odds closer to linear.
vars_log1p <- c("FAF", "TUE")
vars_log <- setdiff(cont_vars, vars_log1p)
df_t <- df
# Apply transformation
for (v in vars_log) {
df_t[[v]] <- log(df_t[[v]])
}
for (v in vars_log1p) {
df_t[[v]] <- log1p(df_t[[v]])
}
# Summary of transformation results
cat("Transformation completed\n")## Transformation completed
## Variables log(x) : Age, Height, Weight, FCVC, NCP, CH2O
## Variables log(x+1) : FAF, TUE
# Check for invalid values
check <- sapply(cont_vars, function(v) sum(!is.finite(df_t[[v]])))
cat("Number of invalid values (should be 0):\n")## Number of invalid values (should be 0):
## Age Height Weight FCVC NCP CH2O FAF TUE
## 0 0 0 0 0 0 0 0
df_bt <- df_t[complete.cases(df_t[, cont_vars]), ]
# Create interaction terms X * ln(X)
for (v in cont_vars) {
x <- df_bt[[v]]
# Avoid values <= 0 to ensure log is defined
if (any(x <= 0, na.rm = TRUE)) {
x <- x - min(x, na.rm = TRUE) + 0.001
}
df_bt[[paste0(v, "_ln")]] <- x * log(x)
}
# Convert target to unordered (requirement for Box-Tidwell)
df_bt$target_unord <- relevel(
factor(as.character(df_bt$NObeyesdad)),
ref = "Normal_Weight"
)
# Define model formula
bt_formula <- as.formula(paste(
"target_unord ~",
paste(cont_vars, collapse = " + "), "+",
paste(paste0(cont_vars, "_ln"), collapse = " + ")
))
set.seed(42)
model_bt <- multinom(bt_formula, data = df_bt, trace = FALSE, maxit = 500)
# Compute p-values
z_bt <- summary(model_bt)$coefficients / summary(model_bt)$standard.errors
p_bt <- 2 * (1 - pnorm(abs(z_bt)))
ln_terms <- paste0(cont_vars, "_ln")
p_ln <- apply(p_bt[, ln_terms, drop = FALSE], 2, min)
# Summary of results
bt_result <- data.frame(
Variable = cont_vars,
Transformation = c("log(Age)", "log(Height)", "log(Weight)",
"log(FCVC)", "log(NCP)", "log(CH2O)",
"log1p(FAF)", "log1p(TUE)"),
p_value = round(p_ln, 4),
Interpretation = ifelse(p_ln > 0.05, "Satisfied", "Not satisfied")
)
rownames(bt_result) <- NULL
knitr::kable(bt_result, caption = "Box-Tidwell Test Results")| Variable | Transformation | p_value | Interpretation |
|---|---|---|---|
| Age | log(Age) | 0.1531 | Satisfied |
| Height | log(Height) | 0.0008 | Not satisfied |
| Weight | log(Weight) | 0.0000 | Not satisfied |
| FCVC | log(FCVC) | 0.0000 | Not satisfied |
| NCP | log(NCP) | 0.0000 | Not satisfied |
| CH2O | log(CH2O) | 0.0389 | Not satisfied |
| FAF | log1p(FAF) | 0.0000 | Not satisfied |
| TUE | log1p(TUE) | 0.0000 | Not satisfied |
Interpretation: The Box-Tidwell test results indicate that not all variables satisfy the linearity assumption (p > 0.05). This suggests that the relationship between some continuous variables and the log-odds is not fully linear, even after applying log transformation. This condition is common in real-world data. Therefore, the model can still be used.
n_obs <- nrow(df_t)
n_dup <- sum(duplicated(df_t[, c(cont_vars, cat_vars)]))
cat("Number of observations :", n_obs, "\n")## Number of observations : 2111
## Number of duplicates : 24
# Removing duplicates
df_t_clean <- df_t[!duplicated(df_t[, c(cont_vars, cat_vars)]), ]
cat("Number of observations After removing duplicates :", nrow(df_t_clean), "\n\n")## Number of observations After removing duplicates : 2087
Interpretation: The dataset is cross-sectional, where each observation represents a different individual. A small number of duplicate records were identified and subsequently removed. After cleaning, no repeated observations remain, so the independence assumption can be considered satisfied.
Multicollinearity occurs when some predictor variables are highly correlated with each other, which can lead to instability in the estimation of model coefficients.
To detect this issue, two approaches are used:
As a rule of thumb, a VIF value above 10 indicates a serious problem, while values below 5 are generally considered acceptable.
cor_mat <- cor(df_t[, cont_vars], use = "complete.obs")
corrplot(cor_mat,
method = "color",
type = "upper",
addCoef.col = "black",
number.cex = 0.7,
tl.cex = 0.8,
col = colorRampPalette(c("#2166AC", "white", "#D6604D"))(200),
title = "Correlation Matrix (After Log Transformation)",
mar = c(0, 0, 2, 0))df_vif <- df_t
df_vif$NObeyesdad_num <- as.numeric(df_t$NObeyesdad)
vif_formula <- as.formula(paste(
"NObeyesdad_num ~",
paste(cont_vars, collapse = " + "),
"+ Gender + family_history_with_overweight + FAVC +",
"CAEC + SMOKE + SCC + CALC + MTRANS"
))
lm_vif <- lm(vif_formula, data = df_vif)
vif_val <- vif(lm_vif)
vif_df <- data.frame(
Variable = names(vif_val[, 1]),
VIF = round(vif_val[, 1], 3),
Interpretation = ifelse(vif_val[, 1] < 5, "Safe",
ifelse(vif_val[, 1] < 10, "Moderate concern", "High"))
)
rownames(vif_df) <- NULL
knitr::kable(vif_df, caption = "VIF Values of Variables")| Variable | VIF | Interpretation |
|---|---|---|
| Age | 1.061 | Safe |
| Height | 1.020 | Safe |
| Weight | 1.075 | Safe |
| FCVC | 1.020 | Safe |
| NCP | 1.030 | Safe |
| CH2O | 1.026 | Safe |
| FAF | 1.030 | Safe |
| TUE | 1.044 | Safe |
| Gender | 1.098 | Safe |
| family_history_with_overweight | 1.249 | Safe |
| FAVC | 1.163 | Safe |
| CAEC | 1.311 | Safe |
| SMOKE | 1.030 | Safe |
| SCC | 1.090 | Safe |
| CALC | 1.142 | Safe |
| MTRANS | 1.227 | Safe |
Interpretation: Based on the VIF results, all variables have values below 5. This indicates that there is no high correlation among predictors that could affect the stability of the model. Therefore, the assumption of no multicollinearity can be considered satisfied.
Logistic regression models are quite sensitive to the presence of outliers, especially in continuous variables. Therefore, it is important to handle extreme values to prevent distortion in the model results.
The approaches used in this analysis are:
detect_outliers <- function(x, var_name) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
n_out <- sum(x < lower | x > upper, na.rm = TRUE)
data.frame(
Variable = var_name,
Lower = round(lower, 3),
Upper = round(upper, 3),
Outliers = n_out,
Percent = round(n_out / length(x) * 100, 2)
)
}
outlier_tbl <- do.call(rbind,
lapply(cont_vars, function(v) detect_outliers(df_t[[v]], v)))
rownames(outlier_tbl) <- NULL
knitr::kable(outlier_tbl, caption = "Outliers After Log Transformation")| Variable | Lower | Upper | Outliers | Percent |
|---|---|---|---|---|
| Age | 2.490 | 3.719 | 155 | 7.34 |
| Height | 0.356 | 0.715 | 157 | 7.44 |
| Weight | 3.254 | 5.494 | 138 | 6.54 |
| FCVC | 0.085 | 1.707 | 139 | 6.58 |
| NCP | 0.894 | 1.221 | 670 | 31.74 |
| CH2O | -0.216 | 1.670 | 109 | 5.16 |
| FAF | -1.251 | 2.398 | 76 | 3.60 |
| TUE | -1.040 | 1.733 | 35 | 1.66 |
Interpretation: After the log transformation, the number of outliers in each variable has decreased compared to the original data. However, there are still some extreme values that need to be handled further.
plots_before <- lapply(cont_vars, function(v) {
ggplot(df_t, aes_string(y = v)) +
geom_boxplot(fill = "#4292C6",
outlier.color = "red",
outlier.size = 1.2) +
labs(title = v, x = NULL, y = NULL) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10))
})
grid.arrange(grobs = plots_before, ncol = 4,
top = "Boxplot Before Capping (After Log Transformation)")cap_outlier <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR_val <- Q3 - Q1
lower <- Q1 - 1.5 * IQR_val
upper <- Q3 + 1.5 * IQR_val
x[x < lower] <- lower
x[x > upper] <- upper
x
}
df_clean <- df_t_clean
for (v in cont_vars) {
df_clean[[v]] <- cap_outlier(df_clean[[v]])
}
## Comparison of outliers before and after capping
for (v in cont_vars) {
Q1 <- quantile(df_t[[v]], 0.25)
Q3 <- quantile(df_t[[v]], 0.75)
n_before <- sum(df_t[[v]] < Q1 - 1.5*IQR(df_t[[v]]) |
df_t[[v]] > Q3 + 1.5*IQR(df_t[[v]]))
Q1c <- quantile(df_clean[[v]], 0.25)
Q3c <- quantile(df_clean[[v]], 0.75)
n_after <- sum(df_clean[[v]] < Q1c - 1.5*IQR(df_clean[[v]]) |
df_clean[[v]] > Q3c + 1.5*IQR(df_clean[[v]]))
cat(sprintf("%-10s : %3d → %d\n", v, n_before, n_after))
}## Age : 155 → 0
## Height : 157 → 0
## Weight : 138 → 0
## FCVC : 139 → 0
## NCP : 670 → 0
## CH2O : 109 → 0
## FAF : 76 → 0
## TUE : 35 → 0
plots_after <- lapply(cont_vars, function(v) {
ggplot(df_clean, aes_string(y = v)) +
geom_boxplot(fill = "#74C476",
outlier.color = "red",
outlier.size = 1.2) +
labs(title = v, x = NULL, y = NULL) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5, size = 10))
})
grid.arrange(grobs = plots_after, ncol = 4,
top = "Boxplot After Capping")
Interpretation: After applying capping, extreme values
have been successfully limited within the IQR range. This reduces the
influence of outliers and makes the data distribution more stable.
Therefore, the impact of outliers has been effectively handled.
summary_assumptions <- data.frame(
No = 1:4,
Assumption = c("Linearity", "Independence", "No Multicollinearity", "No Outliers"),
Method = c("Box-Tidwell",
"Data structure check",
"Correlation & VIF",
"IQR + Winsorizing"),
Treatment = c("Log/log1p transformation applied to approximate linearity",
"No special treatment",
"No special treatment",
"Log transformation followed by capping"),
Status = c("Partially satisfied",
"Satisfied",
"Satisfied",
"Satisfied after treatment")
)
knitr::kable(summary_assumptions,
caption = "Summary of Assumption Testing for Ordinal Logistic Regression Model")| No | Assumption | Method | Treatment | Status |
|---|---|---|---|---|
| 1 | Linearity | Box-Tidwell | Log/log1p transformation applied to approximate linearity | Partially satisfied |
| 2 | Independence | Data structure check | No special treatment | Satisfied |
| 3 | No Multicollinearity | Correlation & VIF | No special treatment | Satisfied |
| 4 | No Outliers | IQR + Winsorizing | Log transformation followed by capping | Satisfied after treatment |
cat(
"The preprocessed dataset contains", nrow(df_clean), "observations and",
ncol(df_clean), "variables.\n\n",
"The distribution of obesity level categories shows that all classes remain well represented:\n"
)## The preprocessed dataset contains 2087 observations and 17 variables.
##
## The distribution of obesity level categories shows that all classes remain well represented:
##
## Insufficient_Weight Normal_Weight Overweight_Level_I Overweight_Level_II
## 267 282 276 290
## Obesity_Type_I Obesity_Type_II Obesity_Type_III
## 351 297 324
All continuous variables have undergone log (or log1p) transformation to improve relationship patterns and outlier handling using the IQR-based capping method, resulting in a more stable dataset that is ready for Ordinal Logistic Regression modeling.
Prepared for the Generalized Linear Models practicum – Data Science, UNESA