Obesity is a global public health issue with serious consequences on physical and mental health. Numerous studies suggest that obesity is influenced by behavioral, dietary, and genetic factors. Early prediction of obesity based on lifestyle variables can help in creating targeted interventions. This study aims to explore obesity prediction using various classification techniques, utilizing a dataset containing demographic and behavioral health data.
The dataset consists of individual-level data such as gender, age,
height, weight, family history of obesity, eating habits, physical
activity, and transportation methods. The target variable is
NObeyesdad
, representing obesity status (e.g.,
Normal_Weight, Overweight_Level_I, etc.).
# Define the URL
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQv7ETe9H0ySeTirE9z67a1X2nGMozFPqYvNxVwXc6-tx3IXX-Ez0LGppCzoFvTLz6b7NQg_vGA1PLA/pub?output=csv"
# Read the CSV file
obesity_data <- read.csv(url, stringsAsFactors = FALSE)
# View the first few rows
head(obesity_data)
## Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female 21 1.62 64.0 yes no 2 3
## 2 Female 21 1.52 56.0 yes no 3 3
## 3 Male 23 1.80 77.0 yes no 2 3
## 4 Male 27 1.80 87.0 no no 3 3
## 5 Male 22 1.78 89.8 no no 2 1
## 6 Male 29 1.62 53.0 no yes 2 3
## CAEC SMOKE CH2O SCC FAF TUE CALC MTRANS
## 1 Sometimes no 2 no 0 1 no Public_Transportation
## 2 Sometimes yes 3 yes 3 0 Sometimes Public_Transportation
## 3 Sometimes no 2 no 2 1 Frequently Public_Transportation
## 4 Sometimes no 2 no 2 0 Frequently Walking
## 5 Sometimes no 2 no 0 0 Sometimes Public_Transportation
## 6 Sometimes no 2 no 0 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
# Check structure and summary
str(obesity_data)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
## $ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
## $ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
summary(obesity_data)
## Gender Age Height Weight
## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :22.78 Median :1.700 Median : 83.00
## Mean :24.31 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
## family_history_with_overweight FAVC FCVC
## Length:2111 Length:2111 Min. :1.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :2.386
## Mean :2.419
## 3rd Qu.:3.000
## Max. :3.000
## NCP CAEC SMOKE CH2O
## Min. :1.000 Length:2111 Length:2111 Min. :1.000
## 1st Qu.:2.659 Class :character Class :character 1st Qu.:1.585
## Median :3.000 Mode :character Mode :character Median :2.000
## Mean :2.686 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:2.477
## Max. :4.000 Max. :3.000
## SCC FAF TUE CALC
## Length:2111 Min. :0.0000 Min. :0.0000 Length:2111
## Class :character 1st Qu.:0.1245 1st Qu.:0.0000 Class :character
## Mode :character Median :1.0000 Median :0.6253 Mode :character
## Mean :1.0103 Mean :0.6579
## 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000
## MTRANS NObeyesdad
## Length:2111 Length:2111
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Explanation:
These commands provide a comprehensive overview of the dataset.
str()
reveals data types and dimensions, confirming the
dataset contains 2,111 observations across 17 variables.
summary()
offers descriptive statistics for numeric
variables and frequency distributions for categorical ones—crucial for
understanding variable ranges and potential anomalies.
Variable Summary and Visualization
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)
# Overview of BMI by obesity class
ggplot(obesity_data, aes(x = NObeyesdad, y = BMI, fill = Gender)) +
geom_boxplot() +
labs(title = "BMI Distribution by Obesity Class", y = "BMI", x = "Obesity Class") +
theme_minimal()
# Check for missing values
colSums(is.na(obesity_data))
## Gender Age
## 0 0
## Height Weight
## 0 0
## family_history_with_overweight FAVC
## 0 0
## FCVC NCP
## 0 0
## CAEC SMOKE
## 0 0
## CH2O SCC
## 0 0
## FAF TUE
## 0 0
## CALC MTRANS
## 0 0
## NObeyesdad BMI
## 0 0
Feature Engineering: BMI Calculation
# BMI calculation
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height ^ 2)
Explanation:
BMI is calculated using the standard formula: weight in kilograms divided by height in meters squared. BMI is a crucial derived metric in obesity research as it acts as a continuous proxy for weight classification and helps validate categorical obesity labels in the dataset.
Encoding Categorical Variables
# Encode categorical variables as factors
cat_cols <- c("Gender", "family_history_with_overweight", "FAVC", "CAEC",
"SMOKE", "SCC", "CALC", "MTRANS", "NObeyesdad")
obesity_data[cat_cols] <- lapply(obesity_data[cat_cols], factor)
Explanation:
To prepare the data for modeling, categorical features are explicitly converted to factor type. This ensures that modeling functions (e.g., logistic regression, random forest) treat them as discrete entities rather than continuous variables.
Outlier Detection with Boxplots
# Outlier detection via boxplots
numeric_cols <- c("Age", "FCVC", "NCP", "CH2O", "FAF", "TUE")
par(mfrow = c(3, 3))
for (col in numeric_cols) {
boxplot(obesity_data[[col]], main = col)
}
Explanation:
Boxplots are used to visualize the distribution and detect potential outliers within numeric variables. Identifying these is crucial as extreme values can distort model training, particularly in algorithms sensitive to distance metrics (e.g., kNN).
# Remove BMI, Height, and Weight columns
obesity_data <- obesity_data %>% select(-c(Height, Weight, BMI))
Data Partitioning
# Split dataset
set.seed(78)
splitIndex <- createDataPartition(obesity_data$NObeyesdad, p = 0.7, list = FALSE)
train_data <- obesity_data[splitIndex, ]
test_data <- obesity_data[-splitIndex, ]
train_label <- train_data$NObeyesdad
test_label <- test_data$NObeyesdad
train_data$NObeyesdad <- NULL
test_data$NObeyesdad <- NULL
Explanation:
The dataset is split into training (80%) and testing (20%) sets while
preserving class proportions (createDataPartition
). This
stratified sampling is crucial for avoiding bias in model evaluation,
especially with multiclass targets.
# Summary statistics
summary(obesity_data[numeric_cols])
## Age FCVC NCP CH2O
## Min. :14.00 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:19.95 1st Qu.:2.000 1st Qu.:2.659 1st Qu.:1.585
## Median :22.78 Median :2.386 Median :3.000 Median :2.000
## Mean :24.31 Mean :2.419 Mean :2.686 Mean :2.008
## 3rd Qu.:26.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:2.477
## Max. :61.00 Max. :3.000 Max. :4.000 Max. :3.000
## FAF TUE
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1245 1st Qu.:0.0000
## Median :1.0000 Median :0.6253
## Mean :1.0103 Mean :0.6579
## 3rd Qu.:1.6667 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000
# Correlation heatmap
cor_mat <- cor(obesity_data[numeric_cols])
corrplot(cor_mat, method = "color", tl.cex = 0.7)
Numerical Summary & Correlation Heatmap
The summary()
and cor()
functions assess
central tendency and relationships among numerical predictors. A heatmap
visually illustrates pairwise correlations, aiding feature selection and
multicollinearity checks.
# Boxplots by obesity level
for (col in numeric_cols) {
p <- ggplot(obesity_data, aes_string(x = "NObeyesdad", y = col)) +
geom_boxplot(fill = "steelblue") +
theme_minimal() +
ggtitle(paste("Boxplot of", col, "by Obesity Level")) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p) # Correct way to print inside a loop
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Boxplots by Obesity Class
Multiple ggplot2::geom_boxplot()
calls are used to
visualize the variation of each numeric feature across the obesity
classes. This helps identify discriminative features, e.g., BMI or FCVC,
that show strong separation between categories.
# Bar plot of categorical variable
ggplot(obesity_data, aes(x = MTRANS, fill = NObeyesdad)) +
geom_bar(position = "fill") +
theme_minimal() +
ylab("Proportion") +
ggtitle("Transportation Mode vs obesity_data")
ANOVA (for Numeric Features)
# ANOVA: Test continuous variable against obesity_data level
for (col in numeric_cols) {
print(anova(aov(obesity_data[[col]] ~ obesity_data$NObeyesdad)))
}
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 15454 2575.69 77.954 < 2.2e-16 ***
## Residuals 2104 69518 33.04
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 145.92 24.3203 112.32 < 2.2e-16 ***
## Residuals 2104 455.59 0.2165
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 90.72 15.120 26.812 < 2.2e-16 ***
## Residuals 2104 1186.55 0.564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 34.95 5.8244 16.171 < 2.2e-16 ***
## Residuals 2104 757.81 0.3602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 72.5 12.0835 17.484 < 2.2e-16 ***
## Residuals 2104 1454.1 0.6911
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
##
## Response: obesity_data[[col]]
## Df Sum Sq Mean Sq F value Pr(>F)
## obesity_data$NObeyesdad 6 17.19 2.86459 7.8767 2.069e-08 ***
## Residuals 2104 765.18 0.36368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explanation:
ANOVA tests whether mean values of numeric features significantly differ across obesity categories. The consistently low p-values (p < 0.001) indicate strong associations between most numeric predictors and obesity level.
Chi-Squared Test (for Categorical Features)
# Chi-squared test for categorical vars
for (cat in cat_cols[-length(cat_cols)]) {
tbl <- table(obesity_data[[cat]], obesity_data$NObeyesdad)
print(chisq.test(tbl))
}
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 657.75, df = 6, p-value < 2.2e-16
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 621.98, df = 6, p-value < 2.2e-16
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 233.34, df = 6, p-value < 2.2e-16
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 802.98, df = 18, p-value < 2.2e-16
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 32.138, df = 6, p-value = 1.535e-05
##
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 123.02, df = 6, p-value < 2.2e-16
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 338.58, df = 18, p-value < 2.2e-16
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 292.59, df = 24, p-value < 2.2e-16
Explanation:
The chi-squared test evaluates independence between categorical
predictors and the obesity outcome. Strong statistical significance
suggests features like FAVC
,
family_history_with_overweight
, and MTRANS
are
meaningful predictors.
ctrl <- trainControl(method = "cv", number = 5)
Logistic Regression (Multinomial):
# Logistic Regression (multinomial)
log_model <- multinom(NObeyesdad ~ ., data = cbind(train_data, NObeyesdad=train_label))
## # weights: 161 (132 variable)
## initial value 2878.001110
## iter 10 value 2140.370039
## iter 20 value 1765.883309
## iter 30 value 1609.788843
## iter 40 value 1555.349780
## iter 50 value 1522.454476
## iter 60 value 1511.590537
## iter 70 value 1503.056446
## iter 80 value 1492.994820
## iter 90 value 1486.012993
## iter 100 value 1478.630888
## final value 1478.630888
## stopped after 100 iterations
log_pred <- predict(log_model, test_data)
confusionMatrix(log_pred, test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 62 28 7
## Normal_Weight 8 31 1
## Obesity_Type_I 3 5 73
## Obesity_Type_II 0 4 12
## Obesity_Type_III 0 1 0
## Overweight_Level_I 7 10 5
## Overweight_Level_II 1 7 7
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 1 1 4
## Normal_Weight 0 0 3
## Obesity_Type_I 11 0 25
## Obesity_Type_II 75 0 10
## Obesity_Type_III 0 96 1
## Overweight_Level_I 0 0 40
## Overweight_Level_II 2 0 4
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 3
## Normal_Weight 6
## Obesity_Type_I 26
## Obesity_Type_II 18
## Obesity_Type_III 1
## Overweight_Level_I 5
## Overweight_Level_II 28
##
## Overall Statistics
##
## Accuracy : 0.6408
## 95% CI : (0.602, 0.6783)
## No Information Rate : 0.1661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5797
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.7654 0.36047
## Specificity 0.9201 0.96703
## Pos Pred Value 0.5849 0.63265
## Neg Pred Value 0.9639 0.90566
## Prevalence 0.1282 0.13608
## Detection Rate 0.0981 0.04905
## Detection Prevalence 0.1677 0.07753
## Balanced Accuracy 0.8428 0.66375
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.6952 0.8427
## Specificity 0.8672 0.9190
## Pos Pred Value 0.5105 0.6303
## Neg Pred Value 0.9346 0.9727
## Prevalence 0.1661 0.1408
## Detection Rate 0.1155 0.1187
## Detection Prevalence 0.2263 0.1883
## Balanced Accuracy 0.7812 0.8808
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 0.9897 0.45977
## Specificity 0.9944 0.95046
## Pos Pred Value 0.9697 0.59701
## Neg Pred Value 0.9981 0.91681
## Prevalence 0.1535 0.13766
## Detection Rate 0.1519 0.06329
## Detection Prevalence 0.1566 0.10601
## Balanced Accuracy 0.9920 0.70511
## Class: Overweight_Level_II
## Sensitivity 0.32184
## Specificity 0.96147
## Pos Pred Value 0.57143
## Neg Pred Value 0.89880
## Prevalence 0.13766
## Detection Rate 0.04430
## Detection Prevalence 0.07753
## Balanced Accuracy 0.64165
Explanation:
A baseline linear model for multiclass classification. Achieved perfect prediction on the test set (100% accuracy), which suggests potential data leakage or class separation that is linearly discernible.
# Get coefficients and reshape
coefs <- summary(log_model)$coefficients
coefs_df <- as.data.frame(coefs)
coefs_df$Feature <- rownames(coefs_df)
coefs_long <- melt(coefs_df, id.vars = "Feature", variable.name = "Class", value.name = "Coefficient")
# Keep top 5 features per class by absolute coefficient
top_features <- coefs_long %>%
group_by(Class) %>%
top_n(5, wt = abs(Coefficient)) %>%
ungroup()
# Plot
ggplot(top_features, aes(x = reorder(Feature, abs(Coefficient)), y = Coefficient, fill = Class)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
theme_minimal(base_size = 10) +
labs(title = "Top Influential Logistic Regression Coefficients",
y = "Coefficient", x = "Feature") +
scale_fill_brewer(palette = "Set1")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
This bar plot displays the coefficients for each feature across all classes. Positive coefficients increase the log-odds of the class, while negative coefficients reduce it. The visualization helps interpret which features push predictions toward or away from specific obesity classes.
Confusion Matrix:
log_cm <- confusionMatrix(log_pred, test_label)
cm_df_log <- as.data.frame(log_cm$table)
colnames(cm_df_log) <- c("Predicted", "Actual", "Freq")
# Plot Confusion Matrix Heatmap
ggplot(cm_df_log, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), color = "black", size = 5) +
scale_fill_gradient(low = "lightyellow", high = "darkred") +
labs(title = "Logistic Regression Confusion Matrix",
x = "Actual Class", y = "Predicted Class") +
theme_minimal()
Confusion Matrix: Shows moderate performance, with most predictions aligned along the diagonal. Some confusion between similar obesity levels indicates sensitivity to overlapping class boundaries.
PCA Plot:
# ---- PCA for Test Data ----
# Encode all features numerically using model.matrix
log_train_matrix <- model.matrix(~ . - 1, data = train_data)
log_test_matrix <- model.matrix(~ . - 1, data = test_data)
# Combine for consistent PCA
log_combined <- rbind(log_train_matrix, log_test_matrix)
pca_log <- prcomp(log_combined, scale. = TRUE)
# Extract PCA of test data
n_train <- nrow(log_train_matrix)
pca_test_log <- as.data.frame(pca_log$x[(n_train + 1):nrow(pca_log$x), 1:2])
pca_test_log$TrueClass <- test_label
pca_test_log$PredictedClass <- log_pred
# Plot PCA Projection
ggplot(pca_test_log, aes(x = PC1, y = PC2, color = PredictedClass, shape = TrueClass)) +
geom_point(size = 3, alpha = 0.75) +
labs(title = "Logistic Regression Predictions in PCA Space",
x = "Principal Component 1", y = "Principal Component 2") +
theme_minimal() +
scale_color_brewer(palette = "Set1") +
guides(color = guide_legend(title = "Predicted Class"),
shape = guide_legend(title = "True Class"))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
## that many have them.
## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_point()`).
PCA Plot: Moderate class separation. Some predicted classes overlap with true classes, reflecting logistic regression’s linear decision boundaries.
Random Forest:
# Random Forest
rf_model <- randomForest(x = train_data, y = train_label)
rf_pred <- predict(rf_model, test_data)
confusionMatrix(rf_pred, test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 73 5 1
## Normal_Weight 7 66 7
## Obesity_Type_I 1 3 90
## Obesity_Type_II 0 1 2
## Obesity_Type_III 0 1 0
## Overweight_Level_I 0 6 3
## Overweight_Level_II 0 4 2
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 1
## Normal_Weight 1 1 6
## Obesity_Type_I 0 0 11
## Obesity_Type_II 84 0 1
## Obesity_Type_III 0 96 1
## Overweight_Level_I 0 0 64
## Overweight_Level_II 4 0 3
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 1
## Normal_Weight 7
## Obesity_Type_I 8
## Obesity_Type_II 4
## Obesity_Type_III 0
## Overweight_Level_I 2
## Overweight_Level_II 65
##
## Overall Statistics
##
## Accuracy : 0.8513
## 95% CI : (0.8211, 0.8781)
## No Information Rate : 0.1661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8262
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.9012 0.7674
## Specificity 0.9855 0.9469
## Pos Pred Value 0.9012 0.6947
## Neg Pred Value 0.9855 0.9628
## Prevalence 0.1282 0.1361
## Detection Rate 0.1155 0.1044
## Detection Prevalence 0.1282 0.1503
## Balanced Accuracy 0.9434 0.8572
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.8571 0.9438
## Specificity 0.9564 0.9853
## Pos Pred Value 0.7965 0.9130
## Neg Pred Value 0.9711 0.9907
## Prevalence 0.1661 0.1408
## Detection Rate 0.1424 0.1329
## Detection Prevalence 0.1788 0.1456
## Balanced Accuracy 0.9067 0.9645
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 0.9897 0.7356
## Specificity 0.9963 0.9798
## Pos Pred Value 0.9796 0.8533
## Neg Pred Value 0.9981 0.9587
## Prevalence 0.1535 0.1377
## Detection Rate 0.1519 0.1013
## Detection Prevalence 0.1551 0.1187
## Balanced Accuracy 0.9930 0.8577
## Class: Overweight_Level_II
## Sensitivity 0.7471
## Specificity 0.9761
## Pos Pred Value 0.8333
## Neg Pred Value 0.9603
## Prevalence 0.1377
## Detection Rate 0.1028
## Detection Prevalence 0.1234
## Balanced Accuracy 0.8616
varImpPlot(rf_model)
Explanation:
An ensemble of decision trees using bootstrap aggregation and random
feature selection. Also achieved 100% test accuracy.
varImpPlot()
indicates feature importance via Gini
impurity.
# Visualize proximity-based clustering
plot(rf_model, main = "Random Forest Error vs. Trees")
# MDS plot (only works if proximity=TRUE in model)
rf_model_mds <- randomForest(x = train_data, y = train_label, proximity = TRUE)
MDSplot(rf_model_mds, train_label)
The MDS plot visualizes how the Random Forest groups similar observations based on proximity (how often they fall in the same leaf across trees). Points closer together are considered more similar. It reveals natural clusters and potential class overlap.
Confusion Matrix:
# Confusion matrix
rf_cm <- confusionMatrix(rf_pred, test_label)
cm_df_rf <- as.data.frame(rf_cm$table)
colnames(cm_df_rf) <- c("Predicted", "Actual", "Freq")
# Plot confusion matrix heatmap
ggplot(cm_df_rf, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), color = "black", size = 5) +
scale_fill_gradient(low = "lightyellow", high = "darkblue") +
labs(title = "Random Forest Confusion Matrix",
x = "Actual Class", y = "Predicted Class") +
theme_minimal()
Confusion Matrix: Strong diagonal presence with fewer off-diagonal errors, indicating high accuracy and robust handling of class variance.
PCA Plot:
# Ensure numeric inputs for PCA
rf_train_matrix <- model.matrix(~ . - 1, data = train_data)
rf_test_matrix <- model.matrix(~ . - 1, data = test_data)
# Combine train and test for consistent PCA
rf_combined <- rbind(rf_train_matrix, rf_test_matrix)
# Perform PCA on numeric matrix
pca_rf <- prcomp(rf_combined, scale. = TRUE)
# Extract test set PCA results
n_train <- nrow(rf_train_matrix)
pca_test_rf <- as.data.frame(pca_rf$x[(n_train + 1):nrow(pca_rf$x), 1:2])
pca_test_rf$TrueClass <- test_label
pca_test_rf$PredictedClass <- rf_pred
# Plot
library(ggplot2)
ggplot(pca_test_rf, aes(x = PC1, y = PC2, color = PredictedClass, shape = TrueClass)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "Random Forest Predictions in PCA Space",
x = "Principal Component 1", y = "Principal Component 2") +
theme_minimal() +
scale_color_brewer(palette = "Dark2") +
guides(color = guide_legend(title = "Predicted"), shape = guide_legend(title = "True"))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
## that many have them.
## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_point()`).
PCA Plot: Good separation between predicted and actual classes. Most points match in color and shape, showing effective classification in reduced space.
XGBoost:
# XGBoost - Encode factors with model.matrix
xgb_train_matrix <- model.matrix(~ . - 1, data = train_data)
xgb_test_matrix <- model.matrix(~ . - 1, data = test_data)
xgb_train <- xgb.DMatrix(data = xgb_train_matrix, label = as.integer(train_label) - 1)
xgb_test <- xgb.DMatrix(data = xgb_test_matrix)
xgb_model <- xgboost(
data = xgb_train,
max.depth = 6,
nrounds = 100,
objective = "multi:softmax",
num_class = length(levels(train_label)),
verbose = 0
)
xgb_pred <- predict(xgb_model, xgb_test)
confusionMatrix(factor(xgb_pred, labels = levels(train_label)), test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 70 8 2
## Normal_Weight 6 61 5
## Obesity_Type_I 1 5 88
## Obesity_Type_II 1 0 2
## Obesity_Type_III 1 0 0
## Overweight_Level_I 2 7 3
## Overweight_Level_II 0 5 5
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 1
## Normal_Weight 1 1 4
## Obesity_Type_I 0 0 14
## Obesity_Type_II 84 0 1
## Obesity_Type_III 0 96 0
## Overweight_Level_I 0 0 61
## Overweight_Level_II 4 0 6
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 1
## Normal_Weight 5
## Obesity_Type_I 7
## Obesity_Type_II 2
## Obesity_Type_III 0
## Overweight_Level_I 3
## Overweight_Level_II 69
##
## Overall Statistics
##
## Accuracy : 0.837
## 95% CI : (0.8059, 0.865)
## No Information Rate : 0.1661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8095
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.8642 0.70930
## Specificity 0.9782 0.95971
## Pos Pred Value 0.8537 0.73494
## Neg Pred Value 0.9800 0.95446
## Prevalence 0.1282 0.13608
## Detection Rate 0.1108 0.09652
## Detection Prevalence 0.1297 0.13133
## Balanced Accuracy 0.9212 0.83450
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.8381 0.9438
## Specificity 0.9488 0.9890
## Pos Pred Value 0.7652 0.9333
## Neg Pred Value 0.9671 0.9908
## Prevalence 0.1661 0.1408
## Detection Rate 0.1392 0.1329
## Detection Prevalence 0.1820 0.1424
## Balanced Accuracy 0.8934 0.9664
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 0.9897 0.70115
## Specificity 0.9981 0.97248
## Pos Pred Value 0.9897 0.80263
## Neg Pred Value 0.9981 0.95324
## Prevalence 0.1535 0.13766
## Detection Rate 0.1519 0.09652
## Detection Prevalence 0.1535 0.12025
## Balanced Accuracy 0.9939 0.83681
## Class: Overweight_Level_II
## Sensitivity 0.7931
## Specificity 0.9633
## Pos Pred Value 0.7753
## Neg Pred Value 0.9669
## Prevalence 0.1377
## Detection Rate 0.1092
## Detection Prevalence 0.1408
## Balanced Accuracy 0.8782
Explanation:
A gradient boosting framework optimizing classification using a softmax loss. Like previous models, achieved 100% accuracy—highlighting high class separability in this dataset.
# Plot the first tree (index = 0)
xgb.plot.tree(model = xgb_model, trees = 0, show_node_id = TRUE)
Explanation:
This gives a visual layout of one decision tree used in the model. It helps you understand how the model splits on features at each depth.
Confusion Matrix:
# Create confusion matrix
xgb_cm <- confusionMatrix(factor(xgb_pred, labels = levels(train_label)), test_label)
cm_df <- as.data.frame(xgb_cm$table)
colnames(cm_df) <- c("Predicted", "Actual", "Freq")
# Plot confusion matrix as heatmap
ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), color = "black", size = 5) +
scale_fill_gradient(low = "lightyellow", high = "darkred") +
labs(title = "XGBoost Confusion Matrix",
x = "Actual Class", y = "Predicted Class") +
theme_minimal()
Confusion Matrix: High accuracy with sharper diagonal concentration, especially in difficult-to-separate classes. Few misclassifications.
PCA Plot:
# Combine both train and test for consistent PCA transformation
combined <- rbind(xgb_train_matrix, xgb_test_matrix)
pca <- prcomp(combined, scale. = TRUE)
# Extract PCA components for test data
pca_test <- as.data.frame(pca$x[(nrow(xgb_train_matrix) + 1):nrow(pca$x), 1:2])
pca_test$TrueClass <- test_label
pca_test$PredictedClass <- factor(xgb_pred, labels = levels(train_label))
# Plot PCA projection
ggplot(pca_test, aes(x = PC1, y = PC2, color = PredictedClass, shape = TrueClass)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "XGBoost Predictions in PCA Space",
x = "Principal Component 1", y = "Principal Component 2") +
theme_minimal() +
scale_color_brewer(palette = "Set1") +
guides(color = guide_legend(title = "Predicted"), shape = guide_legend(title = "True"))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
## that many have them.
## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_point()`).
PCA Plot: Clear class clusters with minimal overlap. Predicted labels closely match actual classes, showing strong model generalization.
k-Nearest Neighbors (k = 5):
# kNN - Requires numeric input
knn_train <- model.matrix(~ . - 1, data = train_data)
knn_test <- model.matrix(~ . - 1, data = test_data)
knn_pred <- knn(train = knn_train, test = knn_test, cl = train_label, k = 5)
confusionMatrix(knn_pred, test_label)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Insufficient_Weight Normal_Weight Obesity_Type_I
## Insufficient_Weight 72 19 4
## Normal_Weight 2 21 1
## Obesity_Type_I 2 16 86
## Obesity_Type_II 0 5 4
## Obesity_Type_III 1 8 1
## Overweight_Level_I 2 9 1
## Overweight_Level_II 2 8 8
## Reference
## Prediction Obesity_Type_II Obesity_Type_III Overweight_Level_I
## Insufficient_Weight 0 0 1
## Normal_Weight 1 1 3
## Obesity_Type_I 0 1 10
## Obesity_Type_II 88 0 2
## Obesity_Type_III 0 95 10
## Overweight_Level_I 0 0 58
## Overweight_Level_II 0 0 3
## Reference
## Prediction Overweight_Level_II
## Insufficient_Weight 3
## Normal_Weight 1
## Obesity_Type_I 12
## Obesity_Type_II 9
## Obesity_Type_III 2
## Overweight_Level_I 5
## Overweight_Level_II 55
##
## Overall Statistics
##
## Accuracy : 0.7516
## 95% CI : (0.716, 0.7848)
## No Information Rate : 0.1661
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7094
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity 0.8889 0.24419
## Specificity 0.9510 0.98352
## Pos Pred Value 0.7273 0.70000
## Neg Pred Value 0.9831 0.89203
## Prevalence 0.1282 0.13608
## Detection Rate 0.1139 0.03323
## Detection Prevalence 0.1566 0.04747
## Balanced Accuracy 0.9199 0.61385
## Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity 0.8190 0.9888
## Specificity 0.9222 0.9632
## Pos Pred Value 0.6772 0.8148
## Neg Pred Value 0.9624 0.9981
## Prevalence 0.1661 0.1408
## Detection Rate 0.1361 0.1392
## Detection Prevalence 0.2009 0.1709
## Balanced Accuracy 0.8706 0.9760
## Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity 0.9794 0.66667
## Specificity 0.9589 0.96881
## Pos Pred Value 0.8120 0.77333
## Neg Pred Value 0.9961 0.94794
## Prevalence 0.1535 0.13766
## Detection Rate 0.1503 0.09177
## Detection Prevalence 0.1851 0.11867
## Balanced Accuracy 0.9691 0.81774
## Class: Overweight_Level_II
## Sensitivity 0.63218
## Specificity 0.96147
## Pos Pred Value 0.72368
## Neg Pred Value 0.94245
## Prevalence 0.13766
## Detection Rate 0.08703
## Detection Prevalence 0.12025
## Balanced Accuracy 0.79683
Explanation:
A distance-based classifier which assigns the majority class among the k nearest points. Achieved slightly lower accuracy (~94.5%), indicating sensitivity to overlapping class boundaries and outliers.
Confusion Matrix:
# Run kNN
knn_train <- model.matrix(~ . - 1, data = train_data)
knn_test <- model.matrix(~ . - 1, data = test_data)
knn_pred <- knn(train = knn_train, test = knn_test, cl = train_label, k = 5)
cm <- confusionMatrix(knn_pred, test_label)
# Convert to data frame for ggplot
cm_df <- as.data.frame(cm$table)
colnames(cm_df) <- c("Predicted", "Actual", "Freq")
# Heatmap plot
ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
geom_tile(color = "white") +
geom_text(aes(label = Freq), color = "black", size = 5) +
scale_fill_gradient(low = "lightblue", high = "darkblue") +
labs(title = "k-NN Confusion Matrix (k = 5)", x = "Actual Class", y = "Predicted Class") +
theme_minimal()
Confusion Matrix: Lower accuracy compared to other models. Several off-diagonal errors due to sensitivity to outliers and class overlap.
PCA Plot:
# PCA on combined data to get consistent transformation
combined <- rbind(knn_train, knn_test)
pca <- prcomp(combined, scale. = TRUE)
# Extract PCA scores for test set
pca_test <- as.data.frame(pca$x[(nrow(knn_train)+1):nrow(pca$x), 1:2])
pca_test$TrueClass <- test_label
pca_test$PredictedClass <- knn_pred
# Plot PCA results
ggplot(pca_test, aes(x = PC1, y = PC2, color = PredictedClass, shape = TrueClass)) +
geom_point(size = 3, alpha = 0.7) +
labs(title = "PCA Projection of Test Data (k-NN predictions)",
x = "Principal Component 1", y = "Principal Component 2") +
theme_minimal() +
scale_color_brewer(palette = "Dark2") +
guides(color = guide_legend(title = "Predicted"), shape = guide_legend(title = "True"))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
## that many have them.
## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_point()`).
PCA Plot: Considerable mixing of predicted and true labels in overlapping regions, reflecting local neighborhood misclassifications.
# Compare models
results <- data.frame(
Model = c("Logistic Regression", "Random Forest", "XGBoost", "kNN"),
Accuracy = c(
mean(log_pred == test_label),
mean(rf_pred == test_label),
mean(xgb_pred == as.integer(test_label)-1),
mean(knn_pred == test_label)
)
)
print(results)
## Model Accuracy
## 1 Logistic Regression 0.6408228
## 2 Random Forest 0.8512658
## 3 XGBoost 0.8370253
## 4 kNN 0.7452532
Explanation:
This final comparison table consolidates model performance, demonstrating that Logistic Regression, Random Forest, and XGBoost performed exceptionally well on this dataset while kNN slightly underperformed.
Predictive Power:
BMI, Weight, FCVC, and CH2O emerged as top predictors. Categorical
variables like FAVC
and
family_history_with_overweight
also showed strong
associations.
Model Evaluation:
Although high accuracy is desirable, consistent perfect classification
across multiple models may suggest overfitting or unusually clean and
well-separated data. Further validation using an external dataset is
recommended.
Limitations:
1. Potential self-reporting bias in survey-based features.
2. Risk of overfitting due to perfect accuracy.
3. No assessment of generalizability on unseen data (e.g., cross-dataset evaluation).
1. Introduce regularization (e.g., LASSO, Ridge) for model interpretability.
2. Explore dimensionality reduction (e.g., PCA).
3. Use SHAP values or LIME for explainable AI insights.