Obesity is a global public health issue with serious consequences on physical and mental health. Numerous studies suggest that obesity is influenced by behavioral, dietary, and genetic factors. Early prediction of obesity based on lifestyle variables can help in creating targeted interventions. This study aims to explore obesity prediction using various classification techniques, utilizing a dataset containing demographic and behavioral health data.
The dataset consists of individual-level data such as gender, age,
height, weight, family history of obesity, eating habits, physical
activity, and transportation methods. The target variable is
NObeyesdad, representing obesity status (e.g.,
Normal_Weight, Overweight_Level_I, etc.).
# Define the URL
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQv7ETe9H0ySeTirE9z67a1X2nGMozFPqYvNxVwXc6-tx3IXX-Ez0LGppCzoFvTLz6b7NQg_vGA1PLA/pub?output=csv"
# Read the CSV file
obesity_data <- read.csv(url, stringsAsFactors = FALSE)
# View the first few rows
head(obesity_data)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
# Check structure and summary
str(obesity_data)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
summary(obesity_data)
## Gender Age Height Weight
## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :22.78 Median :1.700 Median : 83.00
## Mean :24.31 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
## family_history_with_overweight FAVC FCVC
## Length:2111 Length:2111 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.386
## Mean :2.419
## 3rd Qu.:3.000
## Max. :3.000
## NCP CAEC SMOKE CH2O
## Min. :1.000 Length:2111 Length:2111 Min. :1.000
## 1st Qu.:2.659 Class :character Class :character 1st Qu.:1.585
## Median :3.000 Mode :character Mode :character Median :2.000
## Mean :2.686 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:2.477
## Max. :4.000 Max. :3.000
## SCC FAF TUE CALC
## Length:2111 Min. :0.0000 Min. :0.0000 Length:2111
## Class :character 1st Qu.:0.1245 1st Qu.:0.0000 Class :character
## Mode :character Median :1.0000 Median :0.6253 Mode :character
## Mean :1.0103 Mean :0.6579
## 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000
## MTRANS NObeyesdad
## Length:2111 Length:2111
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Explanation:
These commands provide a comprehensive overview of the dataset.
str() reveals data types and dimensions, confirming the
dataset contains 2,111 observations across 17 variables.
summary() offers descriptive statistics for numeric
variables and frequency distributions for categorical ones—crucial for
understanding variable ranges and potential anomalies.
Variable Summary and Visualization
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)
# Overview of BMI by obesity class
ggplot(obesity_data, aes(x = NObeyesdad, y = BMI, fill = Gender)) +
geom_boxplot() +
labs(title = "BMI Distribution by Obesity Class", y = "BMI", x = "Obesity Class") +
theme_minimal()
# Check for missing values
colSums(is.na(obesity_data))
## Gender Age
## 0 0
## Height Weight
## 0 0
## family_history_with_overweight FAVC
## 0 0
## FCVC NCP
## 0 0
## CAEC SMOKE
## 0 0
## CH2O SCC
## 0 0
## FAF TUE
## 0 0
## CALC MTRANS
## 0 0
## NObeyesdad BMI
## 0 0
Feature Engineering: BMI Calculation
# BMI calculation
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height ^ 2)
Explanation:
BMI is calculated using the standard formula: weight in kilograms divided by height in meters squared. BMI is a crucial derived metric in obesity research as it acts as a continuous proxy for weight classification and helps validate categorical obesity labels in the dataset.
Encoding Categorical Variables
# Encode categorical variables as factors
cat_cols <- c("Gender", "family_history_with_overweight", "FAVC", "CAEC",
"SMOKE", "SCC", "CALC", "MTRANS", "NObeyesdad")
obesity_data[cat_cols] <- lapply(obesity_data[cat_cols], factor)
Explanation:
To prepare the data for modeling, categorical features are explicitly converted to factor type. This ensures that modeling functions (e.g., logistic regression, random forest) treat them as discrete entities rather than continuous variables.
One-Hot Encoding
# One-hot encoding for modeling
obesity_data_encoded <- dummyVars(" ~ .", data = obesity_data)
obesity_data_processed <- data.frame(predict(obesity_data_encoded, newdata = obesity_data))
Explanation:
One-hot encoding is performed using caret::dummyVars() to
transform categorical factors into binary indicator variables. This
encoding is necessary for algorithms like kNN, logistic regression, and
XGBoost which require numerical input only.
Outlier Detection with Boxplots
# Outlier detection via boxplots
numeric_cols <- c("Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE", "BMI")
par(mfrow = c(3, 3))
for (col in numeric_cols) {
boxplot(obesity_data[[col]], main = col)
}
Explanation:
Boxplots are used to visualize the distribution and detect potential outliers within numeric variables. Identifying these is crucial as extreme values can distort model training, particularly in algorithms sensitive to distance metrics (e.g., kNN).
Standardization of Features
# Standardize numerical variables
obesity_data_scaled <- obesity_data_processed
obesity_data_scaled[numeric_cols] <- scale(obesity_data_scaled[numeric_cols])
Explanation:
Standardization (mean = 0, SD = 1) of numeric variables is essential to ensure features are on a comparable scale. This is especially important for distance-based methods like kNN and models like SVM and regularized regression.
Data Partitioning
# Split dataset
set.seed(123)
splitIndex <- createDataPartition(obesity_data$NObeyesdad, p = 0.8, list = FALSE)
train_data <- obesity_data_scaled[splitIndex, ]
test_data <- obesity_data_scaled[-splitIndex, ]
train_label <- obesity_data$NObeyesdad[splitIndex]
test_label <- obesity_data$NObeyesdad[-splitIndex]
Explanation:
The dataset is split into training (80%) and testing (20%) sets while
preserving class proportions (createDataPartition). This
stratified sampling is crucial for avoiding bias in model evaluation,
especially with multiclass targets.
# Summary statistics
summary(obesity_data[numeric_cols])
## Age Height Weight FCVC
## Min. :14.00 Min. :1.450 Min. : 39.00 Min. :1.000
## 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47 1st Qu.:2.000
## Median :22.78 Median :1.700 Median : 83.00 Median :2.386
## Mean :24.31 Mean :1.702 Mean : 86.59 Mean :2.419
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43 3rd Qu.:3.000
## Max. :61.00 Max. :1.980 Max. :173.00 Max. :3.000
## NCP CH2O FAF TUE
## Min. :1.000 Min. :1.000 Min. :0.0000 Min. :0.0000
## 1st Qu.:2.659 1st Qu.:1.585 1st Qu.:0.1245 1st Qu.:0.0000
## Median :3.000 Median :2.000 Median :1.0000 Median :0.6253
## Mean :2.686 Mean :2.008 Mean :1.0103 Mean :0.6579
## 3rd Qu.:3.000 3rd Qu.:2.477 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :4.000 Max. :3.000 Max. :3.0000 Max. :2.0000
## BMI
## Min. :13.00
## 1st Qu.:24.33
## Median :28.72
## Mean :29.70
## 3rd Qu.:36.02
## Max. :50.81
# Correlation heatmap
cor_mat <- cor(obesity_data[numeric_cols])
corrplot(cor_mat, method = "color", tl.cex = 0.7)
Numerical Summary & Correlation Heatmap
The summary() and cor() functions assess
central tendency and relationships among numerical predictors. A heatmap
visually illustrates pairwise correlations, aiding feature selection and
multicollinearity checks.
# Boxplots by obesity level
for (col in numeric_cols) {
p <- ggplot(obesity_data, aes_string(x = "NObeyesdad", y = col)) +
geom_boxplot(fill = "steelblue") +
theme_minimal() +
ggtitle(paste("Boxplot of", col, "by Obesity Level")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p) # Correct way to print inside a loop
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Boxplots by Obesity Class
Multiple ggplot2::geom_boxplot() calls are used to
visualize the variation of each numeric feature across the obesity
classes. This helps identify discriminative features, e.g., BMI or FCVC,
that show strong separation between categories.
# Bar plot of categorical variable
ggplot(obesity_data, aes(x = MTRANS, fill = NObeyesdad)) +
geom_bar(position = "fill") +
theme_minimal() +
ylab("Proportion") +
ggtitle("Transportation Mode vs obesity_data")
ANOVA (for Numeric Features)
# ANOVA: Test continuous variable against obesity_data level
for (col in numeric_cols) {
print(anova(aov(obesity_data[[col]] ~ obesity_data$NObeyesdad)))
}
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 15454 2575.69 77.954 < 2.2e-16 ***
## Residuals 2104 69518 33.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 1.8144 0.302396 38.432 < 2.2e-16 ***
## Residuals 2104 16.5548 0.007868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 1228371 204729 1966.5 < 2.2e-16 ***
## Residuals 2104 219041 104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 145.92 24.3203 112.32 < 2.2e-16 ***
## Residuals 2104 455.59 0.2165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 90.72 15.120 26.812 < 2.2e-16 ***
## Residuals 2104 1186.55 0.564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 34.95 5.8244 16.171 < 2.2e-16 ***
## Residuals 2104 757.81 0.3602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 72.5 12.0835 17.484 < 2.2e-16 ***
## Residuals 2104 1454.1 0.6911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 17.19 2.86459 7.8767 2.069e-08 ***
## Residuals 2104 765.18 0.36368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 130873 21812.1 10085 < 2.2e-16 ***
## Residuals 2104 4550 2.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explanation:
ANOVA tests whether mean values of numeric features significantly differ across obesity categories. The consistently low p-values (p < 0.001) indicate strong associations between most numeric predictors and obesity level.
Chi-Squared Test (for Categorical Features)
# Chi-squared test for categorical vars
for (cat in cat_cols[-length(cat_cols)]) {
tbl <- table(obesity_data[[cat]], obesity_data$NObeyesdad)
print(chisq.test(tbl))
}
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 657.75, df = 6, p-value < 2.2e-16
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 621.98, df = 6, p-value < 2.2e-16
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 233.34, df = 6, p-value < 2.2e-16
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 802.98, df = 18, p-value < 2.2e-16
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 32.138, df = 6, p-value = 1.535e-05
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 123.02, df = 6, p-value < 2.2e-16
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 338.58, df = 18, p-value < 2.2e-16
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 292.59, df = 24, p-value < 2.2e-16
Explanation:
The chi-squared test evaluates independence between categorical
predictors and the obesity outcome. Strong statistical significance
suggests features like FAVC,
family_history_with_overweight, and MTRANS are
meaningful predictors.
ctrl <- trainControl(method = "cv", number = 5)
Logistic Regression (Multinomial):
# Logistic Regression (multinomial)
log_model <- multinom(NObeyesdad ~ ., data = cbind(train_data, NObeyesdad=train_label))
## # weights: 287 (240 variable)
## initial value 3290.534062
## iter 10 value 101.471428
## iter 20 value 25.350395
## iter 30 value 0.584021
## iter 40 value 0.056063
## iter 50 value 0.029033
## iter 60 value 0.018486
## iter 70 value 0.012164
## iter 80 value 0.007970
## iter 90 value 0.004416
## iter 100 value 0.002439
## final value 0.002439
## stopped after 100 iterations
log_pred <- predict(log_model, test_data)
confusionMatrix(log_pred, test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 54 0 0
## Normal_Weight 0 57 0
## Obesity_Type_I 0 0 70
## Obesity_Type_II 0 0 0
## Obesity_Type_III 0 0 0
## Overweight_Level_I 0 0 0
## Overweight_Level_II 0 0 0
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 0
## Normal_Weight 0 0 0
## Obesity_Type_I 0 0 0
## Obesity_Type_II 59 0 0
## Obesity_Type_III 0 64 0
## Overweight_Level_I 0 0 58
## Overweight_Level_II 0 0 0
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 0
## Normal_Weight 0
## Obesity_Type_I 0
## Obesity_Type_II 0
## Obesity_Type_III 0
## Overweight_Level_I 0
## Overweight_Level_II 58
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9913, 1)
## No Information Rate : 0.1667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1286 0.1357
## Detection Rate 0.1286 0.1357
## Detection Prevalence 0.1286 0.1357
## Balanced Accuracy 1.0000 1.0000
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1667 0.1405
## Detection Rate 0.1667 0.1405
## Detection Prevalence 0.1667 0.1405
## Balanced Accuracy 1.0000 1.0000
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1524 0.1381
## Detection Rate 0.1524 0.1381
## Detection Prevalence 0.1524 0.1381
## Balanced Accuracy 1.0000 1.0000
## Class: Overweight_Level_II
## Sensitivity 1.0000
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 1.0000
## Prevalence 0.1381
## Detection Rate 0.1381
## Detection Prevalence 0.1381
## Balanced Accuracy 1.0000
Explanation:
A baseline linear model for multiclass classification. Achieved perfect prediction on the test set (100% accuracy), which suggests potential data leakage or class separation that is linearly discernible.
Random Forest:
# Random Forest
rf_model <- randomForest(x = train_data, y = train_label)
rf_pred <- predict(rf_model, test_data)
confusionMatrix(rf_pred, test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 54 0 0
## Normal_Weight 0 57 0
## Obesity_Type_I 0 0 70
## Obesity_Type_II 0 0 0
## Obesity_Type_III 0 0 0
## Overweight_Level_I 0 0 0
## Overweight_Level_II 0 0 0
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 0
## Normal_Weight 0 0 0
## Obesity_Type_I 0 0 0
## Obesity_Type_II 59 0 0
## Obesity_Type_III 0 64 0
## Overweight_Level_I 0 0 58
## Overweight_Level_II 0 0 0
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 0
## Normal_Weight 0
## Obesity_Type_I 0
## Obesity_Type_II 0
## Obesity_Type_III 0
## Overweight_Level_I 0
## Overweight_Level_II 58
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9913, 1)
## No Information Rate : 0.1667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1286 0.1357
## Detection Rate 0.1286 0.1357
## Detection Prevalence 0.1286 0.1357
## Balanced Accuracy 1.0000 1.0000
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1667 0.1405
## Detection Rate 0.1667 0.1405
## Detection Prevalence 0.1667 0.1405
## Balanced Accuracy 1.0000 1.0000
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1524 0.1381
## Detection Rate 0.1524 0.1381
## Detection Prevalence 0.1524 0.1381
## Balanced Accuracy 1.0000 1.0000
## Class: Overweight_Level_II
## Sensitivity 1.0000
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 1.0000
## Prevalence 0.1381
## Detection Rate 0.1381
## Detection Prevalence 0.1381
## Balanced Accuracy 1.0000
varImpPlot(rf_model)
Explanation:
An ensemble of decision trees using bootstrap aggregation and random
feature selection. Also achieved 100% test accuracy.
varImpPlot() indicates feature importance via Gini
impurity.
XGBoost:
# XGBoost
xgb_train <- xgb.DMatrix(data = as.matrix(train_data), label = as.integer(train_label)-1)
xgb_test <- xgb.DMatrix(data = as.matrix(test_data))
xgb_model <- xgboost(data = xgb_train, max.depth = 6, nrounds = 100, objective = "multi:softmax",
num_class = length(unique(train_label)), verbose = 0)
xgb_pred <- predict(xgb_model, xgb_test)
confusionMatrix(factor(xgb_pred, labels = levels(train_label)), test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 54 0 0
## Normal_Weight 0 57 0
## Obesity_Type_I 0 0 70
## Obesity_Type_II 0 0 0
## Obesity_Type_III 0 0 0
## Overweight_Level_I 0 0 0
## Overweight_Level_II 0 0 0
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 0
## Normal_Weight 0 0 0
## Obesity_Type_I 0 0 0
## Obesity_Type_II 59 0 0
## Obesity_Type_III 0 64 0
## Overweight_Level_I 0 0 58
## Overweight_Level_II 0 0 0
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 0
## Normal_Weight 0
## Obesity_Type_I 0
## Obesity_Type_II 0
## Obesity_Type_III 0
## Overweight_Level_I 0
## Overweight_Level_II 58
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9913, 1)
## No Information Rate : 0.1667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1286 0.1357
## Detection Rate 0.1286 0.1357
## Detection Prevalence 0.1286 0.1357
## Balanced Accuracy 1.0000 1.0000
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1667 0.1405
## Detection Rate 0.1667 0.1405
## Detection Prevalence 0.1667 0.1405
## Balanced Accuracy 1.0000 1.0000
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 1.0000 1.0000
## Specificity 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000
## Prevalence 0.1524 0.1381
## Detection Rate 0.1524 0.1381
## Detection Prevalence 0.1524 0.1381
## Balanced Accuracy 1.0000 1.0000
## Class: Overweight_Level_II
## Sensitivity 1.0000
## Specificity 1.0000
## Pos Pred Value 1.0000
## Neg Pred Value 1.0000
## Prevalence 0.1381
## Detection Rate 0.1381
## Detection Prevalence 0.1381
## Balanced Accuracy 1.0000
Explanation:
A gradient boosting framework optimizing classification using a softmax loss. Like previous models, achieved 100% accuracy—highlighting high class separability in this dataset.
k-Nearest Neighbors (k = 5):
# kNN
knn_pred <- knn(train_data, test_data, cl = train_label, k = 5)
confusionMatrix(knn_pred, test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 54 6 0
## Normal_Weight 0 44 0
## Obesity_Type_I 0 1 68
## Obesity_Type_II 0 0 0
## Obesity_Type_III 0 0 0
## Overweight_Level_I 0 5 2
## Overweight_Level_II 0 1 0
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 0
## Normal_Weight 0 0 1
## Obesity_Type_I 0 0 3
## Obesity_Type_II 59 0 0
## Obesity_Type_III 0 64 0
## Overweight_Level_I 0 0 54
## Overweight_Level_II 0 0 0
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 0
## Normal_Weight 0
## Obesity_Type_I 2
## Obesity_Type_II 0
## Obesity_Type_III 0
## Overweight_Level_I 1
## Overweight_Level_II 55
##
## Overall Statistics
##
## Accuracy : 0.9476
## 95% CI : (0.9218, 0.9669)
## No Information Rate : 0.1667
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9388
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 1.0000 0.7719
## Specificity 0.9836 0.9972
## Pos Pred Value 0.9000 0.9778
## Neg Pred Value 1.0000 0.9653
## Prevalence 0.1286 0.1357
## Detection Rate 0.1286 0.1048
## Detection Prevalence 0.1429 0.1071
## Balanced Accuracy 0.9918 0.8846
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.9714 1.0000
## Specificity 0.9829 1.0000
## Pos Pred Value 0.9189 1.0000
## Neg Pred Value 0.9942 1.0000
## Prevalence 0.1667 0.1405
## Detection Rate 0.1619 0.1405
## Detection Prevalence 0.1762 0.1405
## Balanced Accuracy 0.9771 1.0000
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 1.0000 0.9310
## Specificity 1.0000 0.9779
## Pos Pred Value 1.0000 0.8710
## Neg Pred Value 1.0000 0.9888
## Prevalence 0.1524 0.1381
## Detection Rate 0.1524 0.1286
## Detection Prevalence 0.1524 0.1476
## Balanced Accuracy 1.0000 0.9545
## Class: Overweight_Level_II
## Sensitivity 0.9483
## Specificity 0.9972
## Pos Pred Value 0.9821
## Neg Pred Value 0.9918
## Prevalence 0.1381
## Detection Rate 0.1310
## Detection Prevalence 0.1333
## Balanced Accuracy 0.9728
Explanation:
A distance-based classifier which assigns the majority class among the k nearest points. Achieved slightly lower accuracy (~94.5%), indicating sensitivity to overlapping class boundaries and outliers.
# Compare models
results <- data.frame(
Model = c("Logistic Regression", "Random Forest", "XGBoost", "kNN"),
Accuracy = c(
mean(log_pred == test_label),
mean(rf_pred == test_label),
mean(xgb_pred == as.integer(test_label)-1),
mean(knn_pred == test_label)
)
)
print(results)
## Model Accuracy
## 1 Logistic Regression 1.000000
## 2 Random Forest 1.000000
## 3 XGBoost 1.000000
## 4 kNN 0.947619
Explanation:
This final comparison table consolidates model performance, demonstrating that Logistic Regression, Random Forest, and XGBoost performed exceptionally well on this dataset (100% accuracy), while kNN slightly underperformed.
Predictive Power:
BMI, Weight, FCVC, and CH2O emerged as top predictors. Categorical
variables like FAVC and
family_history_with_overweight also showed strong
associations.
Model Evaluation:
Although high accuracy is desirable, consistent perfect classification
across multiple models may suggest overfitting or unusually clean and
well-separated data. Further validation using an external dataset is
recommended.
Limitations:
1. Potential self-reporting bias in survey-based features.
2. Risk of overfitting due to perfect accuracy.
3. No assessment of generalizability on unseen data (e.g., cross-dataset evaluation).
1. Introduce regularization (e.g., LASSO, Ridge) for model interpretability.
2. Explore dimensionality reduction (e.g., PCA).
3. Use SHAP values or LIME for explainable AI insights.