Introduction / Background

Obesity is a global public health issue with serious consequences on physical and mental health. Numerous studies suggest that obesity is influenced by behavioral, dietary, and genetic factors. Early prediction of obesity based on lifestyle variables can help in creating targeted interventions. This study aims to explore obesity prediction using various classification techniques, utilizing a dataset containing demographic and behavioral health data.

Dataset Description

The dataset consists of individual-level data such as gender, age, height, weight, family history of obesity, eating habits, physical activity, and transportation methods. The target variable is NObeyesdad, representing obesity status (e.g., Normal_Weight, Overweight_Level_I, etc.).

Data Loading and Initial Exploration

# Define the URL
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQv7ETe9H0ySeTirE9z67a1X2nGMozFPqYvNxVwXc6-tx3IXX-Ez0LGppCzoFvTLz6b7NQg_vGA1PLA/pub?output=csv"
 
# Read the CSV file
obesity_data <- read.csv(url, stringsAsFactors = FALSE)
 
# View the first few rows
head(obesity_data)
##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight
# Check structure and summary
str(obesity_data)
## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr  "no" "no" "no" "no" ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "yes" "no" "no" ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr  "no" "yes" "no" "no" ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
summary(obesity_data)
##     Gender               Age            Height          Weight      
##  Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##  Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
##                     Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##                     3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##                     Max.   :61.00   Max.   :1.980   Max.   :173.00  
##  family_history_with_overweight     FAVC                FCVC      
##  Length:2111                    Length:2111        Min.   :1.000  
##  Class :character               Class :character   1st Qu.:2.000  
##  Mode  :character               Mode  :character   Median :2.386  
##                                                    Mean   :2.419  
##                                                    3rd Qu.:3.000  
##                                                    Max.   :3.000  
##       NCP            CAEC              SMOKE                CH2O      
##  Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
##  1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
##  Median :3.000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :2.686                                         Mean   :2.008  
##  3rd Qu.:3.000                                         3rd Qu.:2.477  
##  Max.   :4.000                                         Max.   :3.000  
##      SCC                 FAF              TUE             CALC          
##  Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
##  Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
##  Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
##                     Mean   :1.0103   Mean   :0.6579                     
##                     3rd Qu.:1.6667   3rd Qu.:1.0000                     
##                     Max.   :3.0000   Max.   :2.0000                     
##     MTRANS           NObeyesdad       
##  Length:2111        Length:2111       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Explanation:

These commands provide a comprehensive overview of the dataset. str() reveals data types and dimensions, confirming the dataset contains 2,111 observations across 17 variables. summary() offers descriptive statistics for numeric variables and frequency distributions for categorical ones—crucial for understanding variable ranges and potential anomalies.

Variable Summary and Visualization

obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)

# Overview of BMI by obesity class
ggplot(obesity_data, aes(x = NObeyesdad, y = BMI, fill = Gender)) +
  geom_boxplot() +
  labs(title = "BMI Distribution by Obesity Class", y = "BMI", x = "Obesity Class") +
  theme_minimal()

# Check for missing values
colSums(is.na(obesity_data))
##                         Gender                            Age 
##                              0                              0 
##                         Height                         Weight 
##                              0                              0 
## family_history_with_overweight                           FAVC 
##                              0                              0 
##                           FCVC                            NCP 
##                              0                              0 
##                           CAEC                          SMOKE 
##                              0                              0 
##                           CH2O                            SCC 
##                              0                              0 
##                            FAF                            TUE 
##                              0                              0 
##                           CALC                         MTRANS 
##                              0                              0 
##                     NObeyesdad                            BMI 
##                              0                              0

Feature Engineering: BMI Calculation

# BMI calculation
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height ^ 2)

Explanation:

BMI is calculated using the standard formula: weight in kilograms divided by height in meters squared. BMI is a crucial derived metric in obesity research as it acts as a continuous proxy for weight classification and helps validate categorical obesity labels in the dataset.

Encoding Categorical Variables

# Encode categorical variables as factors
cat_cols <- c("Gender", "family_history_with_overweight", "FAVC", "CAEC", 
              "SMOKE", "SCC", "CALC", "MTRANS", "NObeyesdad")
obesity_data[cat_cols] <- lapply(obesity_data[cat_cols], factor)

Explanation:

To prepare the data for modeling, categorical features are explicitly converted to factor type. This ensures that modeling functions (e.g., logistic regression, random forest) treat them as discrete entities rather than continuous variables.

Outlier Detection with Boxplots

# Outlier detection via boxplots
numeric_cols <- c("Age", "FCVC", "NCP", "CH2O", "FAF", "TUE")
par(mfrow = c(3, 3))
for (col in numeric_cols) {
  boxplot(obesity_data[[col]], main = col)
}

Explanation:

Boxplots are used to visualize the distribution and detect potential outliers within numeric variables. Identifying these is crucial as extreme values can distort model training, particularly in algorithms sensitive to distance metrics (e.g., kNN).

# Remove BMI, Height, and Weight columns
obesity_data <- obesity_data %>% select(-c(Height, Weight, BMI))

Data Partitioning

# Split dataset
set.seed(78)
splitIndex <- createDataPartition(obesity_data$NObeyesdad, p = 0.7, list = FALSE)
train_data <- obesity_data[splitIndex, ]
test_data <- obesity_data[-splitIndex, ]
train_label <- train_data$NObeyesdad
test_label <- test_data$NObeyesdad
train_data$NObeyesdad <- NULL
test_data$NObeyesdad <- NULL

Explanation:

The dataset is split into training (80%) and testing (20%) sets while preserving class proportions (createDataPartition). This stratified sampling is crucial for avoiding bias in model evaluation, especially with multiclass targets.


2. Exploratory Data Analysis (EDA)

# Summary statistics
summary(obesity_data[numeric_cols])
##       Age             FCVC            NCP             CH2O      
##  Min.   :14.00   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:19.95   1st Qu.:2.000   1st Qu.:2.659   1st Qu.:1.585  
##  Median :22.78   Median :2.386   Median :3.000   Median :2.000  
##  Mean   :24.31   Mean   :2.419   Mean   :2.686   Mean   :2.008  
##  3rd Qu.:26.00   3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:2.477  
##  Max.   :61.00   Max.   :3.000   Max.   :4.000   Max.   :3.000  
##       FAF              TUE        
##  Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.1245   1st Qu.:0.0000  
##  Median :1.0000   Median :0.6253  
##  Mean   :1.0103   Mean   :0.6579  
##  3rd Qu.:1.6667   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :2.0000
# Correlation heatmap
cor_mat <- cor(obesity_data[numeric_cols])
corrplot(cor_mat, method = "color", tl.cex = 0.7)

Numerical Summary & Correlation Heatmap

The summary() and cor() functions assess central tendency and relationships among numerical predictors. A heatmap visually illustrates pairwise correlations, aiding feature selection and multicollinearity checks.

# Boxplots by obesity level
for (col in numeric_cols) {
  p <- ggplot(obesity_data, aes_string(x = "NObeyesdad", y = col)) +
    geom_boxplot(fill = "steelblue") +
    theme_minimal() +
    ggtitle(paste("Boxplot of", col, "by Obesity Level")) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  print(p)  # Correct way to print inside a loop
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Boxplots by Obesity Class

Multiple ggplot2::geom_boxplot() calls are used to visualize the variation of each numeric feature across the obesity classes. This helps identify discriminative features, e.g., BMI or FCVC, that show strong separation between categories.

# Bar plot of categorical variable
ggplot(obesity_data, aes(x = MTRANS, fill = NObeyesdad)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  ylab("Proportion") +
  ggtitle("Transportation Mode vs obesity_data")


3. Statistical Analysis

ANOVA (for Numeric Features)

# ANOVA: Test continuous variable against obesity_data level
for (col in numeric_cols) {
  print(anova(aov(obesity_data[[col]] ~ obesity_data$NObeyesdad)))
}
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6  15454 2575.69  77.954 < 2.2e-16 ***
## Residuals               2104  69518   33.04                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6 145.92 24.3203  112.32 < 2.2e-16 ***
## Residuals               2104 455.59  0.2165                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df  Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6   90.72  15.120  26.812 < 2.2e-16 ***
## Residuals               2104 1186.55   0.564                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6  34.95  5.8244  16.171 < 2.2e-16 ***
## Residuals               2104 757.81  0.3602                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6   72.5 12.0835  17.484 < 2.2e-16 ***
## Residuals               2104 1454.1  0.6911                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6  17.19 2.86459  7.8767 2.069e-08 ***
## Residuals               2104 765.18 0.36368                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation:

ANOVA tests whether mean values of numeric features significantly differ across obesity categories. The consistently low p-values (p < 0.001) indicate strong associations between most numeric predictors and obesity level.

Chi-Squared Test (for Categorical Features)

# Chi-squared test for categorical vars
for (cat in cat_cols[-length(cat_cols)]) {
  tbl <- table(obesity_data[[cat]], obesity_data$NObeyesdad)
  print(chisq.test(tbl))
}
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 657.75, df = 6, p-value < 2.2e-16
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 621.98, df = 6, p-value < 2.2e-16
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 233.34, df = 6, p-value < 2.2e-16
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 802.98, df = 18, p-value < 2.2e-16
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 32.138, df = 6, p-value = 1.535e-05
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 123.02, df = 6, p-value < 2.2e-16
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 338.58, df = 18, p-value < 2.2e-16
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 292.59, df = 24, p-value < 2.2e-16

Explanation:
The chi-squared test evaluates independence between categorical predictors and the obesity outcome. Strong statistical significance suggests features like FAVC, family_history_with_overweight, and MTRANS are meaningful predictors.


4. Machine Learning Models

ctrl <- trainControl(method = "cv", number = 5)

Logistic Regression (Multinomial):

# Logistic Regression (multinomial)
log_model <- multinom(NObeyesdad ~ ., data = cbind(train_data, NObeyesdad=train_label))
## # weights:  161 (132 variable)
## initial  value 2878.001110 
## iter  10 value 2140.370039
## iter  20 value 1765.883309
## iter  30 value 1609.788843
## iter  40 value 1555.349780
## iter  50 value 1522.454476
## iter  60 value 1511.590537
## iter  70 value 1503.056446
## iter  80 value 1492.994820
## iter  90 value 1486.012993
## iter 100 value 1478.630888
## final  value 1478.630888 
## stopped after 100 iterations
log_pred <- predict(log_model, test_data)
confusionMatrix(log_pred, test_label)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  62            28              7
##   Normal_Weight                         8            31              1
##   Obesity_Type_I                        3             5             73
##   Obesity_Type_II                       0             4             12
##   Obesity_Type_III                      0             1              0
##   Overweight_Level_I                    7            10              5
##   Overweight_Level_II                   1             7              7
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               1                1                  4
##   Normal_Weight                     0                0                  3
##   Obesity_Type_I                   11                0                 25
##   Obesity_Type_II                  75                0                 10
##   Obesity_Type_III                  0               96                  1
##   Overweight_Level_I                0                0                 40
##   Overweight_Level_II               2                0                  4
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   3
##   Normal_Weight                         6
##   Obesity_Type_I                       26
##   Obesity_Type_II                      18
##   Obesity_Type_III                      1
##   Overweight_Level_I                    5
##   Overweight_Level_II                  28
## 
## Overall Statistics
##                                          
##                Accuracy : 0.6408         
##                  95% CI : (0.602, 0.6783)
##     No Information Rate : 0.1661         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.5797         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              0.7654              0.36047
## Specificity                              0.9201              0.96703
## Pos Pred Value                           0.5849              0.63265
## Neg Pred Value                           0.9639              0.90566
## Prevalence                               0.1282              0.13608
## Detection Rate                           0.0981              0.04905
## Detection Prevalence                     0.1677              0.07753
## Balanced Accuracy                        0.8428              0.66375
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         0.6952                 0.8427
## Specificity                         0.8672                 0.9190
## Pos Pred Value                      0.5105                 0.6303
## Neg Pred Value                      0.9346                 0.9727
## Prevalence                          0.1661                 0.1408
## Detection Rate                      0.1155                 0.1187
## Detection Prevalence                0.2263                 0.1883
## Balanced Accuracy                   0.7812                 0.8808
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           0.9897                   0.45977
## Specificity                           0.9944                   0.95046
## Pos Pred Value                        0.9697                   0.59701
## Neg Pred Value                        0.9981                   0.91681
## Prevalence                            0.1535                   0.13766
## Detection Rate                        0.1519                   0.06329
## Detection Prevalence                  0.1566                   0.10601
## Balanced Accuracy                     0.9920                   0.70511
##                      Class: Overweight_Level_II
## Sensitivity                             0.32184
## Specificity                             0.96147
## Pos Pred Value                          0.57143
## Neg Pred Value                          0.89880
## Prevalence                              0.13766
## Detection Rate                          0.04430
## Detection Prevalence                    0.07753
## Balanced Accuracy                       0.64165

Explanation:

A baseline linear model for multiclass classification. Achieved perfect prediction on the test set (100% accuracy), which suggests potential data leakage or class separation that is linearly discernible.

# Get coefficients and reshape
coefs <- summary(log_model)$coefficients
coefs_df <- as.data.frame(coefs)
coefs_df$Feature <- rownames(coefs_df)
coefs_long <- melt(coefs_df, id.vars = "Feature", variable.name = "Class", value.name = "Coefficient")

# Keep top 5 features per class by absolute coefficient
top_features <- coefs_long %>%
  group_by(Class) %>%
  top_n(5, wt = abs(Coefficient)) %>%
  ungroup()

# Plot
ggplot(top_features, aes(x = reorder(Feature, abs(Coefficient)), y = Coefficient, fill = Class)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  theme_minimal(base_size = 10) +
  labs(title = "Top Influential Logistic Regression Coefficients",
       y = "Coefficient", x = "Feature") +
  scale_fill_brewer(palette = "Set1")
## Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

This bar plot displays the coefficients for each feature across all classes. Positive coefficients increase the log-odds of the class, while negative coefficients reduce it. The visualization helps interpret which features push predictions toward or away from specific obesity classes.

Confusion Matrix:

log_cm <- confusionMatrix(log_pred, test_label)
cm_df_log <- as.data.frame(log_cm$table)
colnames(cm_df_log) <- c("Predicted", "Actual", "Freq")

# Plot Confusion Matrix Heatmap
ggplot(cm_df_log, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), color = "black", size = 5) +
  scale_fill_gradient(low = "lightyellow", high = "darkred") +
  labs(title = "Logistic Regression Confusion Matrix",
       x = "Actual Class", y = "Predicted Class") +
  theme_minimal()

Confusion Matrix: Shows moderate performance, with most predictions aligned along the diagonal. Some confusion between similar obesity levels indicates sensitivity to overlapping class boundaries.

PCA Plot:

# ---- PCA for Test Data ----
# Encode all features numerically using model.matrix
log_train_matrix <- model.matrix(~ . - 1, data = train_data)
log_test_matrix <- model.matrix(~ . - 1, data = test_data)

# Combine for consistent PCA
log_combined <- rbind(log_train_matrix, log_test_matrix)
pca_log <- prcomp(log_combined, scale. = TRUE)

# Extract PCA of test data
n_train <- nrow(log_train_matrix)
pca_test_log <- as.data.frame(pca_log$x[(n_train + 1):nrow(pca_log$x), 1:2])
pca_test_log$TrueClass <- test_label
pca_test_log$PredictedClass <- log_pred

# Plot PCA Projection
ggplot(pca_test_log, aes(x = PC1, y = PC2, color = PredictedClass, shape = TrueClass)) +
  geom_point(size = 3, alpha = 0.75) +
  labs(title = "Logistic Regression Predictions in PCA Space",
       x = "Principal Component 1", y = "Principal Component 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") +
  guides(color = guide_legend(title = "Predicted Class"),
         shape = guide_legend(title = "True Class"))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
##   that many have them.
## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_point()`).

PCA Plot: Moderate class separation. Some predicted classes overlap with true classes, reflecting logistic regression’s linear decision boundaries.

Random Forest:

# Random Forest
rf_model <- randomForest(x = train_data, y = train_label)
rf_pred <- predict(rf_model, test_data)
confusionMatrix(rf_pred, test_label)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  73             5              1
##   Normal_Weight                         7            66              7
##   Obesity_Type_I                        1             3             90
##   Obesity_Type_II                       0             1              2
##   Obesity_Type_III                      0             1              0
##   Overweight_Level_I                    0             6              3
##   Overweight_Level_II                   0             4              2
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  1
##   Normal_Weight                     1                1                  6
##   Obesity_Type_I                    0                0                 11
##   Obesity_Type_II                  84                0                  1
##   Obesity_Type_III                  0               96                  1
##   Overweight_Level_I                0                0                 64
##   Overweight_Level_II               4                0                  3
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   1
##   Normal_Weight                         7
##   Obesity_Type_I                        8
##   Obesity_Type_II                       4
##   Obesity_Type_III                      0
##   Overweight_Level_I                    2
##   Overweight_Level_II                  65
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8513          
##                  95% CI : (0.8211, 0.8781)
##     No Information Rate : 0.1661          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8262          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              0.9012               0.7674
## Specificity                              0.9855               0.9469
## Pos Pred Value                           0.9012               0.6947
## Neg Pred Value                           0.9855               0.9628
## Prevalence                               0.1282               0.1361
## Detection Rate                           0.1155               0.1044
## Detection Prevalence                     0.1282               0.1503
## Balanced Accuracy                        0.9434               0.8572
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         0.8571                 0.9438
## Specificity                         0.9564                 0.9853
## Pos Pred Value                      0.7965                 0.9130
## Neg Pred Value                      0.9711                 0.9907
## Prevalence                          0.1661                 0.1408
## Detection Rate                      0.1424                 0.1329
## Detection Prevalence                0.1788                 0.1456
## Balanced Accuracy                   0.9067                 0.9645
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           0.9897                    0.7356
## Specificity                           0.9963                    0.9798
## Pos Pred Value                        0.9796                    0.8533
## Neg Pred Value                        0.9981                    0.9587
## Prevalence                            0.1535                    0.1377
## Detection Rate                        0.1519                    0.1013
## Detection Prevalence                  0.1551                    0.1187
## Balanced Accuracy                     0.9930                    0.8577
##                      Class: Overweight_Level_II
## Sensitivity                              0.7471
## Specificity                              0.9761
## Pos Pred Value                           0.8333
## Neg Pred Value                           0.9603
## Prevalence                               0.1377
## Detection Rate                           0.1028
## Detection Prevalence                     0.1234
## Balanced Accuracy                        0.8616
varImpPlot(rf_model)

Explanation:

An ensemble of decision trees using bootstrap aggregation and random feature selection. Also achieved 100% test accuracy. varImpPlot() indicates feature importance via Gini impurity.

# Visualize proximity-based clustering
plot(rf_model, main = "Random Forest Error vs. Trees")

# MDS plot (only works if proximity=TRUE in model)
rf_model_mds <- randomForest(x = train_data, y = train_label, proximity = TRUE)
MDSplot(rf_model_mds, train_label)

The MDS plot visualizes how the Random Forest groups similar observations based on proximity (how often they fall in the same leaf across trees). Points closer together are considered more similar. It reveals natural clusters and potential class overlap.

Confusion Matrix:

# Confusion matrix
rf_cm <- confusionMatrix(rf_pred, test_label)
cm_df_rf <- as.data.frame(rf_cm$table)
colnames(cm_df_rf) <- c("Predicted", "Actual", "Freq")

# Plot confusion matrix heatmap
ggplot(cm_df_rf, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), color = "black", size = 5) +
  scale_fill_gradient(low = "lightyellow", high = "darkblue") +
  labs(title = "Random Forest Confusion Matrix",
       x = "Actual Class", y = "Predicted Class") +
  theme_minimal()

Confusion Matrix: Strong diagonal presence with fewer off-diagonal errors, indicating high accuracy and robust handling of class variance.

PCA Plot:

# Ensure numeric inputs for PCA
rf_train_matrix <- model.matrix(~ . - 1, data = train_data)
rf_test_matrix <- model.matrix(~ . - 1, data = test_data)

# Combine train and test for consistent PCA
rf_combined <- rbind(rf_train_matrix, rf_test_matrix)

# Perform PCA on numeric matrix
pca_rf <- prcomp(rf_combined, scale. = TRUE)

# Extract test set PCA results
n_train <- nrow(rf_train_matrix)
pca_test_rf <- as.data.frame(pca_rf$x[(n_train + 1):nrow(pca_rf$x), 1:2])
pca_test_rf$TrueClass <- test_label
pca_test_rf$PredictedClass <- rf_pred

# Plot
library(ggplot2)
ggplot(pca_test_rf, aes(x = PC1, y = PC2, color = PredictedClass, shape = TrueClass)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Random Forest Predictions in PCA Space",
       x = "Principal Component 1", y = "Principal Component 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") +
  guides(color = guide_legend(title = "Predicted"), shape = guide_legend(title = "True"))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
##   that many have them.
## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_point()`).

PCA Plot: Good separation between predicted and actual classes. Most points match in color and shape, showing effective classification in reduced space.

XGBoost:

# XGBoost - Encode factors with model.matrix
xgb_train_matrix <- model.matrix(~ . - 1, data = train_data)
xgb_test_matrix <- model.matrix(~ . - 1, data = test_data)

xgb_train <- xgb.DMatrix(data = xgb_train_matrix, label = as.integer(train_label) - 1)
xgb_test <- xgb.DMatrix(data = xgb_test_matrix)

xgb_model <- xgboost(
  data = xgb_train,
  max.depth = 6,
  nrounds = 100,
  objective = "multi:softmax",
  num_class = length(levels(train_label)),
  verbose = 0
)

xgb_pred <- predict(xgb_model, xgb_test)
confusionMatrix(factor(xgb_pred, labels = levels(train_label)), test_label)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  70             8              2
##   Normal_Weight                         6            61              5
##   Obesity_Type_I                        1             5             88
##   Obesity_Type_II                       1             0              2
##   Obesity_Type_III                      1             0              0
##   Overweight_Level_I                    2             7              3
##   Overweight_Level_II                   0             5              5
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  1
##   Normal_Weight                     1                1                  4
##   Obesity_Type_I                    0                0                 14
##   Obesity_Type_II                  84                0                  1
##   Obesity_Type_III                  0               96                  0
##   Overweight_Level_I                0                0                 61
##   Overweight_Level_II               4                0                  6
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   1
##   Normal_Weight                         5
##   Obesity_Type_I                        7
##   Obesity_Type_II                       2
##   Obesity_Type_III                      0
##   Overweight_Level_I                    3
##   Overweight_Level_II                  69
## 
## Overall Statistics
##                                          
##                Accuracy : 0.837          
##                  95% CI : (0.8059, 0.865)
##     No Information Rate : 0.1661         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8095         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              0.8642              0.70930
## Specificity                              0.9782              0.95971
## Pos Pred Value                           0.8537              0.73494
## Neg Pred Value                           0.9800              0.95446
## Prevalence                               0.1282              0.13608
## Detection Rate                           0.1108              0.09652
## Detection Prevalence                     0.1297              0.13133
## Balanced Accuracy                        0.9212              0.83450
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         0.8381                 0.9438
## Specificity                         0.9488                 0.9890
## Pos Pred Value                      0.7652                 0.9333
## Neg Pred Value                      0.9671                 0.9908
## Prevalence                          0.1661                 0.1408
## Detection Rate                      0.1392                 0.1329
## Detection Prevalence                0.1820                 0.1424
## Balanced Accuracy                   0.8934                 0.9664
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           0.9897                   0.70115
## Specificity                           0.9981                   0.97248
## Pos Pred Value                        0.9897                   0.80263
## Neg Pred Value                        0.9981                   0.95324
## Prevalence                            0.1535                   0.13766
## Detection Rate                        0.1519                   0.09652
## Detection Prevalence                  0.1535                   0.12025
## Balanced Accuracy                     0.9939                   0.83681
##                      Class: Overweight_Level_II
## Sensitivity                              0.7931
## Specificity                              0.9633
## Pos Pred Value                           0.7753
## Neg Pred Value                           0.9669
## Prevalence                               0.1377
## Detection Rate                           0.1092
## Detection Prevalence                     0.1408
## Balanced Accuracy                        0.8782

Explanation:

A gradient boosting framework optimizing classification using a softmax loss. Like previous models, achieved 100% accuracy—highlighting high class separability in this dataset.

# Plot the first tree (index = 0)
xgb.plot.tree(model = xgb_model, trees = 0, show_node_id = TRUE)

Explanation:

This gives a visual layout of one decision tree used in the model. It helps you understand how the model splits on features at each depth.

Confusion Matrix:

# Create confusion matrix
xgb_cm <- confusionMatrix(factor(xgb_pred, labels = levels(train_label)), test_label)
cm_df <- as.data.frame(xgb_cm$table)
colnames(cm_df) <- c("Predicted", "Actual", "Freq")

# Plot confusion matrix as heatmap
ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), color = "black", size = 5) +
  scale_fill_gradient(low = "lightyellow", high = "darkred") +
  labs(title = "XGBoost Confusion Matrix",
       x = "Actual Class", y = "Predicted Class") +
  theme_minimal()

Confusion Matrix: High accuracy with sharper diagonal concentration, especially in difficult-to-separate classes. Few misclassifications.

PCA Plot:

# Combine both train and test for consistent PCA transformation
combined <- rbind(xgb_train_matrix, xgb_test_matrix)
pca <- prcomp(combined, scale. = TRUE)

# Extract PCA components for test data
pca_test <- as.data.frame(pca$x[(nrow(xgb_train_matrix) + 1):nrow(pca$x), 1:2])
pca_test$TrueClass <- test_label
pca_test$PredictedClass <- factor(xgb_pred, labels = levels(train_label))

# Plot PCA projection
ggplot(pca_test, aes(x = PC1, y = PC2, color = PredictedClass, shape = TrueClass)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "XGBoost Predictions in PCA Space",
       x = "Principal Component 1", y = "Principal Component 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Set1") +
  guides(color = guide_legend(title = "Predicted"), shape = guide_legend(title = "True"))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
##   that many have them.
## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_point()`).

PCA Plot: Clear class clusters with minimal overlap. Predicted labels closely match actual classes, showing strong model generalization.

k-Nearest Neighbors (k = 5):

# kNN - Requires numeric input
knn_train <- model.matrix(~ . - 1, data = train_data)
knn_test <- model.matrix(~ . - 1, data = test_data)

knn_pred <- knn(train = knn_train, test = knn_test, cl = train_label, k = 5)
confusionMatrix(knn_pred, test_label)
## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  72            19              4
##   Normal_Weight                         2            21              1
##   Obesity_Type_I                        2            16             86
##   Obesity_Type_II                       0             5              4
##   Obesity_Type_III                      1             8              1
##   Overweight_Level_I                    2             9              1
##   Overweight_Level_II                   2             8              8
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  1
##   Normal_Weight                     1                1                  3
##   Obesity_Type_I                    0                1                 10
##   Obesity_Type_II                  88                0                  2
##   Obesity_Type_III                  0               95                 10
##   Overweight_Level_I                0                0                 58
##   Overweight_Level_II               0                0                  3
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   3
##   Normal_Weight                         1
##   Obesity_Type_I                       12
##   Obesity_Type_II                       9
##   Obesity_Type_III                      2
##   Overweight_Level_I                    5
##   Overweight_Level_II                  55
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7516         
##                  95% CI : (0.716, 0.7848)
##     No Information Rate : 0.1661         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7094         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              0.8889              0.24419
## Specificity                              0.9510              0.98352
## Pos Pred Value                           0.7273              0.70000
## Neg Pred Value                           0.9831              0.89203
## Prevalence                               0.1282              0.13608
## Detection Rate                           0.1139              0.03323
## Detection Prevalence                     0.1566              0.04747
## Balanced Accuracy                        0.9199              0.61385
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         0.8190                 0.9888
## Specificity                         0.9222                 0.9632
## Pos Pred Value                      0.6772                 0.8148
## Neg Pred Value                      0.9624                 0.9981
## Prevalence                          0.1661                 0.1408
## Detection Rate                      0.1361                 0.1392
## Detection Prevalence                0.2009                 0.1709
## Balanced Accuracy                   0.8706                 0.9760
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           0.9794                   0.66667
## Specificity                           0.9589                   0.96881
## Pos Pred Value                        0.8120                   0.77333
## Neg Pred Value                        0.9961                   0.94794
## Prevalence                            0.1535                   0.13766
## Detection Rate                        0.1503                   0.09177
## Detection Prevalence                  0.1851                   0.11867
## Balanced Accuracy                     0.9691                   0.81774
##                      Class: Overweight_Level_II
## Sensitivity                             0.63218
## Specificity                             0.96147
## Pos Pred Value                          0.72368
## Neg Pred Value                          0.94245
## Prevalence                              0.13766
## Detection Rate                          0.08703
## Detection Prevalence                    0.12025
## Balanced Accuracy                       0.79683

Explanation:

A distance-based classifier which assigns the majority class among the k nearest points. Achieved slightly lower accuracy (~94.5%), indicating sensitivity to overlapping class boundaries and outliers.

Confusion Matrix:

# Run kNN
knn_train <- model.matrix(~ . - 1, data = train_data)
knn_test <- model.matrix(~ . - 1, data = test_data)
knn_pred <- knn(train = knn_train, test = knn_test, cl = train_label, k = 5)
cm <- confusionMatrix(knn_pred, test_label)

# Convert to data frame for ggplot
cm_df <- as.data.frame(cm$table)
colnames(cm_df) <- c("Predicted", "Actual", "Freq")

# Heatmap plot
ggplot(cm_df, aes(x = Actual, y = Predicted, fill = Freq)) +
  geom_tile(color = "white") +
  geom_text(aes(label = Freq), color = "black", size = 5) +
  scale_fill_gradient(low = "lightblue", high = "darkblue") +
  labs(title = "k-NN Confusion Matrix (k = 5)", x = "Actual Class", y = "Predicted Class") +
  theme_minimal()

Confusion Matrix: Lower accuracy compared to other models. Several off-diagonal errors due to sensitivity to outliers and class overlap.

PCA Plot:

# PCA on combined data to get consistent transformation
combined <- rbind(knn_train, knn_test)
pca <- prcomp(combined, scale. = TRUE)

# Extract PCA scores for test set
pca_test <- as.data.frame(pca$x[(nrow(knn_train)+1):nrow(pca$x), 1:2])
pca_test$TrueClass <- test_label
pca_test$PredictedClass <- knn_pred

# Plot PCA results
ggplot(pca_test, aes(x = PC1, y = PC2, color = PredictedClass, shape = TrueClass)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "PCA Projection of Test Data (k-NN predictions)",
       x = "Principal Component 1", y = "Principal Component 2") +
  theme_minimal() +
  scale_color_brewer(palette = "Dark2") +
  guides(color = guide_legend(title = "Predicted"), shape = guide_legend(title = "True"))
## Warning: The shape palette can deal with a maximum of 6 discrete values because more
## than 6 becomes difficult to discriminate
## ℹ you have requested 7 values. Consider specifying shapes manually if you need
##   that many have them.
## Warning: Removed 87 rows containing missing values or values outside the scale range
## (`geom_point()`).

PCA Plot: Considerable mixing of predicted and true labels in overlapping regions, reflecting local neighborhood misclassifications.


5. Results

# Compare models
results <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "XGBoost", "kNN"),
  Accuracy = c(
    mean(log_pred == test_label),
    mean(rf_pred == test_label),
    mean(xgb_pred == as.integer(test_label)-1),
    mean(knn_pred == test_label)
  )
)
print(results)
##                 Model  Accuracy
## 1 Logistic Regression 0.6408228
## 2       Random Forest 0.8512658
## 3             XGBoost 0.8370253
## 4                 kNN 0.7452532

Explanation:

This final comparison table consolidates model performance, demonstrating that Logistic Regression, Random Forest, and XGBoost performed exceptionally well on this dataset while kNN slightly underperformed.

Final Remarks (Recommendations for Report)

  • Predictive Power:
    BMI, Weight, FCVC, and CH2O emerged as top predictors. Categorical variables like FAVC and family_history_with_overweight also showed strong associations.

  • Model Evaluation:
    Although high accuracy is desirable, consistent perfect classification across multiple models may suggest overfitting or unusually clean and well-separated data. Further validation using an external dataset is recommended.

  • Limitations:

1.   Potential self-reporting bias in survey-based features.

2.   Risk of overfitting due to perfect accuracy.

3.   No assessment of generalizability on unseen data (e.g., cross-dataset evaluation).
  • Future Work:
1.   Introduce regularization (e.g., LASSO, Ridge) for model interpretability.

2.   Explore dimensionality reduction (e.g., PCA).

3.   Use SHAP values or LIME for explainable AI insights.