Obesity Risk Prediction Using Lifestyle Factors

Data Loading and Initial Exploration

# Define the URL
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQv7ETe9H0ySeTirE9z67a1X2nGMozFPqYvNxVwXc6-tx3IXX-Ez0LGppCzoFvTLz6b7NQg_vGA1PLA/pub?output=csv"
 
# Read the CSV file
obesity_data <- read.csv(url, stringsAsFactors = FALSE)
 
# View the first few rows
head(obesity_data)

##   Gender Age Height Weight family_history_with_overweight FAVC FCVC NCP
## 1 Female  21   1.62   64.0                            yes   no    2   3
## 2 Female  21   1.52   56.0                            yes   no    3   3
## 3   Male  23   1.80   77.0                            yes   no    2   3
## 4   Male  27   1.80   87.0                             no   no    3   3
## 5   Male  22   1.78   89.8                             no   no    2   1
## 6   Male  29   1.62   53.0                             no  yes    2   3
##        CAEC SMOKE CH2O SCC FAF TUE       CALC                MTRANS
## 1 Sometimes    no    2  no   0   1         no Public_Transportation
## 2 Sometimes   yes    3 yes   3   0  Sometimes Public_Transportation
## 3 Sometimes    no    2  no   2   1 Frequently Public_Transportation
## 4 Sometimes    no    2  no   2   0 Frequently               Walking
## 5 Sometimes    no    2  no   0   0  Sometimes Public_Transportation
## 6 Sometimes    no    2  no   0   0  Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

# Check structure and summary
str(obesity_data)

## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC                          : chr  "no" "no" "no" "no" ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE                         : chr  "no" "yes" "no" "no" ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ SCC                           : chr  "no" "yes" "no" "no" ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

summary(obesity_data)

##     Gender               Age            Height          Weight      
##  Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##  Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
##                     Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##                     3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##                     Max.   :61.00   Max.   :1.980   Max.   :173.00  
##  family_history_with_overweight     FAVC                FCVC      
##  Length:2111                    Length:2111        Min.   :1.000  
##  Class :character               Class :character   1st Qu.:2.000  
##  Mode  :character               Mode  :character   Median :2.386  
##                                                    Mean   :2.419  
##                                                    3rd Qu.:3.000  
##                                                    Max.   :3.000  
##       NCP            CAEC              SMOKE                CH2O      
##  Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
##  1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
##  Median :3.000   Mode  :character   Mode  :character   Median :2.000  
##  Mean   :2.686                                         Mean   :2.008  
##  3rd Qu.:3.000                                         3rd Qu.:2.477  
##  Max.   :4.000                                         Max.   :3.000  
##      SCC                 FAF              TUE             CALC          
##  Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
##  Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
##  Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
##                     Mean   :1.0103   Mean   :0.6579                     
##                     3rd Qu.:1.6667   3rd Qu.:1.0000                     
##                     Max.   :3.0000   Max.   :2.0000                     
##     MTRANS           NObeyesdad       
##  Length:2111        Length:2111       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
##

Explanation:

These commands provide a comprehensive overview of the dataset. str() reveals data types and dimensions, confirming the dataset contains 2,111 observations across 17 variables. summary() offers descriptive statistics for numeric variables and frequency distributions for categorical ones—crucial for understanding variable ranges and potential anomalies.

Variable Summary and Visualization

obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)

# Overview of BMI by obesity class
ggplot(obesity_data, aes(x = NObeyesdad, y = BMI, fill = Gender)) +
  geom_boxplot() +
  labs(title = "BMI Distribution by Obesity Class", y = "BMI", x = "Obesity Class") +
  theme_minimal()

# Check for missing values
colSums(is.na(obesity_data))

##                         Gender                            Age 
##                              0                              0 
##                         Height                         Weight 
##                              0                              0 
## family_history_with_overweight                           FAVC 
##                              0                              0 
##                           FCVC                            NCP 
##                              0                              0 
##                           CAEC                          SMOKE 
##                              0                              0 
##                           CH2O                            SCC 
##                              0                              0 
##                            FAF                            TUE 
##                              0                              0 
##                           CALC                         MTRANS 
##                              0                              0 
##                     NObeyesdad                            BMI 
##                              0                              0

Feature Engineering: BMI Calculation

# BMI calculation
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height ^ 2)

Explanation:

BMI is calculated using the standard formula: weight in kilograms divided by height in meters squared. BMI is a crucial derived metric in obesity research as it acts as a continuous proxy for weight classification and helps validate categorical obesity labels in the dataset.

Encoding Categorical Variables

# Encode categorical variables as factors
cat_cols <- c("Gender", "family_history_with_overweight", "FAVC", "CAEC", 
              "SMOKE", "SCC", "CALC", "MTRANS", "NObeyesdad")
obesity_data[cat_cols] <- lapply(obesity_data[cat_cols], factor)

Explanation:

To prepare the data for modeling, categorical features are explicitly converted to factor type. This ensures that modeling functions (e.g., logistic regression, random forest) treat them as discrete entities rather than continuous variables.

One-Hot Encoding

# One-hot encoding for modeling
obesity_data_encoded <- dummyVars(" ~ .", data = obesity_data)
obesity_data_processed <- data.frame(predict(obesity_data_encoded, newdata = obesity_data))

Explanation:
One-hot encoding is performed using caret::dummyVars() to transform categorical factors into binary indicator variables. This encoding is necessary for algorithms like kNN, logistic regression, and XGBoost which require numerical input only.

Outlier Detection with Boxplots

# Outlier detection via boxplots
numeric_cols <- c("Age", "Height", "Weight", "FCVC", "NCP", "CH2O", "FAF", "TUE", "BMI")
par(mfrow = c(3, 3))
for (col in numeric_cols) {
  boxplot(obesity_data[[col]], main = col)
}

Explanation:

Boxplots are used to visualize the distribution and detect potential outliers within numeric variables. Identifying these is crucial as extreme values can distort model training, particularly in algorithms sensitive to distance metrics (e.g., kNN).

Standardization of Features

# Standardize numerical variables
obesity_data_scaled <- obesity_data_processed
obesity_data_scaled[numeric_cols] <- scale(obesity_data_scaled[numeric_cols])

Explanation:

Standardization (mean = 0, SD = 1) of numeric variables is essential to ensure features are on a comparable scale. This is especially important for distance-based methods like kNN and models like SVM and regularized regression.

Data Partitioning

# Split dataset
set.seed(123)
splitIndex <- createDataPartition(obesity_data$NObeyesdad, p = 0.8, list = FALSE)
train_data <- obesity_data_scaled[splitIndex, ]
test_data <- obesity_data_scaled[-splitIndex, ]
train_label <- obesity_data$NObeyesdad[splitIndex]
test_label <- obesity_data$NObeyesdad[-splitIndex]

Explanation:

The dataset is split into training (80%) and testing (20%) sets while preserving class proportions (createDataPartition). This stratified sampling is crucial for avoiding bias in model evaluation, especially with multiclass targets.

2. Exploratory Data Analysis (EDA)

# Summary statistics
summary(obesity_data[numeric_cols])

##       Age            Height          Weight            FCVC      
##  Min.   :14.00   Min.   :1.450   Min.   : 39.00   Min.   :1.000  
##  1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47   1st Qu.:2.000  
##  Median :22.78   Median :1.700   Median : 83.00   Median :2.386  
##  Mean   :24.31   Mean   :1.702   Mean   : 86.59   Mean   :2.419  
##  3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43   3rd Qu.:3.000  
##  Max.   :61.00   Max.   :1.980   Max.   :173.00   Max.   :3.000  
##       NCP             CH2O            FAF              TUE        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:2.659   1st Qu.:1.585   1st Qu.:0.1245   1st Qu.:0.0000  
##  Median :3.000   Median :2.000   Median :1.0000   Median :0.6253  
##  Mean   :2.686   Mean   :2.008   Mean   :1.0103   Mean   :0.6579  
##  3rd Qu.:3.000   3rd Qu.:2.477   3rd Qu.:1.6667   3rd Qu.:1.0000  
##  Max.   :4.000   Max.   :3.000   Max.   :3.0000   Max.   :2.0000  
##       BMI       
##  Min.   :13.00  
##  1st Qu.:24.33  
##  Median :28.72  
##  Mean   :29.70  
##  3rd Qu.:36.02  
##  Max.   :50.81

# Correlation heatmap
cor_mat <- cor(obesity_data[numeric_cols])
corrplot(cor_mat, method = "color", tl.cex = 0.7)

Numerical Summary & Correlation Heatmap

The summary() and cor() functions assess central tendency and relationships among numerical predictors. A heatmap visually illustrates pairwise correlations, aiding feature selection and multicollinearity checks.

# Boxplots by obesity level
for (col in numeric_cols) {
  p <- ggplot(obesity_data, aes_string(x = "NObeyesdad", y = col)) +
    geom_boxplot(fill = "steelblue") +
    theme_minimal() +
    ggtitle(paste("Boxplot of", col, "by Obesity Level")) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))
  
  print(p)  # Correct way to print inside a loop
}

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Boxplots by Obesity Class

Multiple ggplot2::geom_boxplot() calls are used to visualize the variation of each numeric feature across the obesity classes. This helps identify discriminative features, e.g., BMI or FCVC, that show strong separation between categories.

# Bar plot of categorical variable
ggplot(obesity_data, aes(x = MTRANS, fill = NObeyesdad)) +
  geom_bar(position = "fill") +
  theme_minimal() +
  ylab("Proportion") +
  ggtitle("Transportation Mode vs obesity_data")

3. Statistical Analysis

ANOVA (for Numeric Features)

# ANOVA: Test continuous variable against obesity_data level
for (col in numeric_cols) {
  print(anova(aov(obesity_data[[col]] ~ obesity_data$NObeyesdad)))
}

## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6  15454 2575.69  77.954 < 2.2e-16 ***
## Residuals               2104  69518   33.04                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df  Sum Sq  Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6  1.8144 0.302396  38.432 < 2.2e-16 ***
## Residuals               2104 16.5548 0.007868                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df  Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6 1228371  204729  1966.5 < 2.2e-16 ***
## Residuals               2104  219041     104                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6 145.92 24.3203  112.32 < 2.2e-16 ***
## Residuals               2104 455.59  0.2165                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df  Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6   90.72  15.120  26.812 < 2.2e-16 ***
## Residuals               2104 1186.55   0.564                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6  34.95  5.8244  16.171 < 2.2e-16 ***
## Residuals               2104 757.81  0.3602                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6   72.5 12.0835  17.484 < 2.2e-16 ***
## Residuals               2104 1454.1  0.6911                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6  17.19 2.86459  7.8767 2.069e-08 ***
## Residuals               2104 765.18 0.36368                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Response: obesity_data[[col]]
##                           Df Sum Sq Mean Sq F value    Pr(>F)    
## obesity_data$NObeyesdad    6 130873 21812.1   10085 < 2.2e-16 ***
## Residuals               2104   4550     2.2                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explanation:

ANOVA tests whether mean values of numeric features significantly differ across obesity categories. The consistently low p-values (p < 0.001) indicate strong associations between most numeric predictors and obesity level.

Chi-Squared Test (for Categorical Features)

# Chi-squared test for categorical vars
for (cat in cat_cols[-length(cat_cols)]) {
  tbl <- table(obesity_data[[cat]], obesity_data$NObeyesdad)
  print(chisq.test(tbl))
}

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 657.75, df = 6, p-value < 2.2e-16
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 621.98, df = 6, p-value < 2.2e-16
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 233.34, df = 6, p-value < 2.2e-16
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 802.98, df = 18, p-value < 2.2e-16
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 32.138, df = 6, p-value = 1.535e-05
## 
## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 123.02, df = 6, p-value < 2.2e-16

## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 338.58, df = 18, p-value < 2.2e-16

## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 292.59, df = 24, p-value < 2.2e-16

Explanation:
The chi-squared test evaluates independence between categorical predictors and the obesity outcome. Strong statistical significance suggests features like FAVC, family_history_with_overweight, and MTRANS are meaningful predictors.

4. Machine Learning Models

ctrl <- trainControl(method = "cv", number = 5)

Logistic Regression (Multinomial):

# Logistic Regression (multinomial)
log_model <- multinom(NObeyesdad ~ ., data = cbind(train_data, NObeyesdad=train_label))

## # weights:  287 (240 variable)
## initial  value 3290.534062 
## iter  10 value 101.471428
## iter  20 value 25.350395
## iter  30 value 0.584021
## iter  40 value 0.056063
## iter  50 value 0.029033
## iter  60 value 0.018486
## iter  70 value 0.012164
## iter  80 value 0.007970
## iter  90 value 0.004416
## iter 100 value 0.002439
## final  value 0.002439 
## stopped after 100 iterations

log_pred <- predict(log_model, test_data)
confusionMatrix(log_pred, test_label)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  54             0              0
##   Normal_Weight                         0            57              0
##   Obesity_Type_I                        0             0             70
##   Obesity_Type_II                       0             0              0
##   Obesity_Type_III                      0             0              0
##   Overweight_Level_I                    0             0              0
##   Overweight_Level_II                   0             0              0
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  0
##   Normal_Weight                     0                0                  0
##   Obesity_Type_I                    0                0                  0
##   Obesity_Type_II                  59                0                  0
##   Obesity_Type_III                  0               64                  0
##   Overweight_Level_I                0                0                 58
##   Overweight_Level_II               0                0                  0
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   0
##   Normal_Weight                         0
##   Obesity_Type_I                        0
##   Obesity_Type_II                       0
##   Obesity_Type_III                      0
##   Overweight_Level_I                    0
##   Overweight_Level_II                  58
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9913, 1)
##     No Information Rate : 0.1667     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              1.0000               1.0000
## Specificity                              1.0000               1.0000
## Pos Pred Value                           1.0000               1.0000
## Neg Pred Value                           1.0000               1.0000
## Prevalence                               0.1286               0.1357
## Detection Rate                           0.1286               0.1357
## Detection Prevalence                     0.1286               0.1357
## Balanced Accuracy                        1.0000               1.0000
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         1.0000                 1.0000
## Specificity                         1.0000                 1.0000
## Pos Pred Value                      1.0000                 1.0000
## Neg Pred Value                      1.0000                 1.0000
## Prevalence                          0.1667                 0.1405
## Detection Rate                      0.1667                 0.1405
## Detection Prevalence                0.1667                 0.1405
## Balanced Accuracy                   1.0000                 1.0000
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           1.0000                    1.0000
## Specificity                           1.0000                    1.0000
## Pos Pred Value                        1.0000                    1.0000
## Neg Pred Value                        1.0000                    1.0000
## Prevalence                            0.1524                    0.1381
## Detection Rate                        0.1524                    0.1381
## Detection Prevalence                  0.1524                    0.1381
## Balanced Accuracy                     1.0000                    1.0000
##                      Class: Overweight_Level_II
## Sensitivity                              1.0000
## Specificity                              1.0000
## Pos Pred Value                           1.0000
## Neg Pred Value                           1.0000
## Prevalence                               0.1381
## Detection Rate                           0.1381
## Detection Prevalence                     0.1381
## Balanced Accuracy                        1.0000

Explanation:

A baseline linear model for multiclass classification. Achieved perfect prediction on the test set (100% accuracy), which suggests potential data leakage or class separation that is linearly discernible.

Random Forest:

# Random Forest
rf_model <- randomForest(x = train_data, y = train_label)
rf_pred <- predict(rf_model, test_data)
confusionMatrix(rf_pred, test_label)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  54             0              0
##   Normal_Weight                         0            57              0
##   Obesity_Type_I                        0             0             70
##   Obesity_Type_II                       0             0              0
##   Obesity_Type_III                      0             0              0
##   Overweight_Level_I                    0             0              0
##   Overweight_Level_II                   0             0              0
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  0
##   Normal_Weight                     0                0                  0
##   Obesity_Type_I                    0                0                  0
##   Obesity_Type_II                  59                0                  0
##   Obesity_Type_III                  0               64                  0
##   Overweight_Level_I                0                0                 58
##   Overweight_Level_II               0                0                  0
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   0
##   Normal_Weight                         0
##   Obesity_Type_I                        0
##   Obesity_Type_II                       0
##   Obesity_Type_III                      0
##   Overweight_Level_I                    0
##   Overweight_Level_II                  58
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9913, 1)
##     No Information Rate : 0.1667     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              1.0000               1.0000
## Specificity                              1.0000               1.0000
## Pos Pred Value                           1.0000               1.0000
## Neg Pred Value                           1.0000               1.0000
## Prevalence                               0.1286               0.1357
## Detection Rate                           0.1286               0.1357
## Detection Prevalence                     0.1286               0.1357
## Balanced Accuracy                        1.0000               1.0000
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         1.0000                 1.0000
## Specificity                         1.0000                 1.0000
## Pos Pred Value                      1.0000                 1.0000
## Neg Pred Value                      1.0000                 1.0000
## Prevalence                          0.1667                 0.1405
## Detection Rate                      0.1667                 0.1405
## Detection Prevalence                0.1667                 0.1405
## Balanced Accuracy                   1.0000                 1.0000
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           1.0000                    1.0000
## Specificity                           1.0000                    1.0000
## Pos Pred Value                        1.0000                    1.0000
## Neg Pred Value                        1.0000                    1.0000
## Prevalence                            0.1524                    0.1381
## Detection Rate                        0.1524                    0.1381
## Detection Prevalence                  0.1524                    0.1381
## Balanced Accuracy                     1.0000                    1.0000
##                      Class: Overweight_Level_II
## Sensitivity                              1.0000
## Specificity                              1.0000
## Pos Pred Value                           1.0000
## Neg Pred Value                           1.0000
## Prevalence                               0.1381
## Detection Rate                           0.1381
## Detection Prevalence                     0.1381
## Balanced Accuracy                        1.0000

varImpPlot(rf_model)

Explanation:

An ensemble of decision trees using bootstrap aggregation and random feature selection. Also achieved 100% test accuracy. varImpPlot() indicates feature importance via Gini impurity.

XGBoost:

# XGBoost
xgb_train <- xgb.DMatrix(data = as.matrix(train_data), label = as.integer(train_label)-1)
xgb_test <- xgb.DMatrix(data = as.matrix(test_data))
xgb_model <- xgboost(data = xgb_train, max.depth = 6, nrounds = 100, objective = "multi:softmax",
                     num_class = length(unique(train_label)), verbose = 0)
xgb_pred <- predict(xgb_model, xgb_test)
confusionMatrix(factor(xgb_pred, labels = levels(train_label)), test_label)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  54             0              0
##   Normal_Weight                         0            57              0
##   Obesity_Type_I                        0             0             70
##   Obesity_Type_II                       0             0              0
##   Obesity_Type_III                      0             0              0
##   Overweight_Level_I                    0             0              0
##   Overweight_Level_II                   0             0              0
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  0
##   Normal_Weight                     0                0                  0
##   Obesity_Type_I                    0                0                  0
##   Obesity_Type_II                  59                0                  0
##   Obesity_Type_III                  0               64                  0
##   Overweight_Level_I                0                0                 58
##   Overweight_Level_II               0                0                  0
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   0
##   Normal_Weight                         0
##   Obesity_Type_I                        0
##   Obesity_Type_II                       0
##   Obesity_Type_III                      0
##   Overweight_Level_I                    0
##   Overweight_Level_II                  58
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9913, 1)
##     No Information Rate : 0.1667     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              1.0000               1.0000
## Specificity                              1.0000               1.0000
## Pos Pred Value                           1.0000               1.0000
## Neg Pred Value                           1.0000               1.0000
## Prevalence                               0.1286               0.1357
## Detection Rate                           0.1286               0.1357
## Detection Prevalence                     0.1286               0.1357
## Balanced Accuracy                        1.0000               1.0000
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         1.0000                 1.0000
## Specificity                         1.0000                 1.0000
## Pos Pred Value                      1.0000                 1.0000
## Neg Pred Value                      1.0000                 1.0000
## Prevalence                          0.1667                 0.1405
## Detection Rate                      0.1667                 0.1405
## Detection Prevalence                0.1667                 0.1405
## Balanced Accuracy                   1.0000                 1.0000
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           1.0000                    1.0000
## Specificity                           1.0000                    1.0000
## Pos Pred Value                        1.0000                    1.0000
## Neg Pred Value                        1.0000                    1.0000
## Prevalence                            0.1524                    0.1381
## Detection Rate                        0.1524                    0.1381
## Detection Prevalence                  0.1524                    0.1381
## Balanced Accuracy                     1.0000                    1.0000
##                      Class: Overweight_Level_II
## Sensitivity                              1.0000
## Specificity                              1.0000
## Pos Pred Value                           1.0000
## Neg Pred Value                           1.0000
## Prevalence                               0.1381
## Detection Rate                           0.1381
## Detection Prevalence                     0.1381
## Balanced Accuracy                        1.0000

Explanation:

A gradient boosting framework optimizing classification using a softmax loss. Like previous models, achieved 100% accuracy—highlighting high class separability in this dataset.

k-Nearest Neighbors (k = 5):

# kNN
knn_pred <- knn(train_data, test_data, cl = train_label, k = 5)
confusionMatrix(knn_pred, test_label)

## Confusion Matrix and Statistics
## 
##                      Reference
## Prediction            Insufficient_Weight Normal_Weight Obesity_Type_I
##   Insufficient_Weight                  54             6              0
##   Normal_Weight                         0            44              0
##   Obesity_Type_I                        0             1             68
##   Obesity_Type_II                       0             0              0
##   Obesity_Type_III                      0             0              0
##   Overweight_Level_I                    0             5              2
##   Overweight_Level_II                   0             1              0
##                      Reference
## Prediction            Obesity_Type_II Obesity_Type_III Overweight_Level_I
##   Insufficient_Weight               0                0                  0
##   Normal_Weight                     0                0                  1
##   Obesity_Type_I                    0                0                  3
##   Obesity_Type_II                  59                0                  0
##   Obesity_Type_III                  0               64                  0
##   Overweight_Level_I                0                0                 54
##   Overweight_Level_II               0                0                  0
##                      Reference
## Prediction            Overweight_Level_II
##   Insufficient_Weight                   0
##   Normal_Weight                         0
##   Obesity_Type_I                        2
##   Obesity_Type_II                       0
##   Obesity_Type_III                      0
##   Overweight_Level_I                    1
##   Overweight_Level_II                  55
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9476          
##                  95% CI : (0.9218, 0.9669)
##     No Information Rate : 0.1667          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9388          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Insufficient_Weight Class: Normal_Weight
## Sensitivity                              1.0000               0.7719
## Specificity                              0.9836               0.9972
## Pos Pred Value                           0.9000               0.9778
## Neg Pred Value                           1.0000               0.9653
## Prevalence                               0.1286               0.1357
## Detection Rate                           0.1286               0.1048
## Detection Prevalence                     0.1429               0.1071
## Balanced Accuracy                        0.9918               0.8846
##                      Class: Obesity_Type_I Class: Obesity_Type_II
## Sensitivity                         0.9714                 1.0000
## Specificity                         0.9829                 1.0000
## Pos Pred Value                      0.9189                 1.0000
## Neg Pred Value                      0.9942                 1.0000
## Prevalence                          0.1667                 0.1405
## Detection Rate                      0.1619                 0.1405
## Detection Prevalence                0.1762                 0.1405
## Balanced Accuracy                   0.9771                 1.0000
##                      Class: Obesity_Type_III Class: Overweight_Level_I
## Sensitivity                           1.0000                    0.9310
## Specificity                           1.0000                    0.9779
## Pos Pred Value                        1.0000                    0.8710
## Neg Pred Value                        1.0000                    0.9888
## Prevalence                            0.1524                    0.1381
## Detection Rate                        0.1524                    0.1286
## Detection Prevalence                  0.1524                    0.1476
## Balanced Accuracy                     1.0000                    0.9545
##                      Class: Overweight_Level_II
## Sensitivity                              0.9483
## Specificity                              0.9972
## Pos Pred Value                           0.9821
## Neg Pred Value                           0.9918
## Prevalence                               0.1381
## Detection Rate                           0.1310
## Detection Prevalence                     0.1333
## Balanced Accuracy                        0.9728

Explanation:

A distance-based classifier which assigns the majority class among the k nearest points. Achieved slightly lower accuracy (~94.5%), indicating sensitivity to overlapping class boundaries and outliers.

5. Results

# Compare models
results <- data.frame(
  Model = c("Logistic Regression", "Random Forest", "XGBoost", "kNN"),
  Accuracy = c(
    mean(log_pred == test_label),
    mean(rf_pred == test_label),
    mean(xgb_pred == as.integer(test_label)-1),
    mean(knn_pred == test_label)
  )
)
print(results)

##                 Model Accuracy
## 1 Logistic Regression 1.000000
## 2       Random Forest 1.000000
## 3             XGBoost 1.000000
## 4                 kNN 0.947619

Explanation:

This final comparison table consolidates model performance, demonstrating that Logistic Regression, Random Forest, and XGBoost performed exceptionally well on this dataset (100% accuracy), while kNN slightly underperformed.

Final Remarks (Recommendations for Report)

Predictive Power:
BMI, Weight, FCVC, and CH2O emerged as top predictors. Categorical variables like FAVC and family_history_with_overweight also showed strong associations.
Model Evaluation:
Although high accuracy is desirable, consistent perfect classification across multiple models may suggest overfitting or unusually clean and well-separated data. Further validation using an external dataset is recommended.
Limitations:

1.   Potential self-reporting bias in survey-based features.

2.   Risk of overfitting due to perfect accuracy.

3.   No assessment of generalizability on unseen data (e.g., cross-dataset evaluation).

Future Work:

1.   Introduce regularization (e.g., LASSO, Ridge) for model interpretability.

2.   Explore dimensionality reduction (e.g., PCA).

3.   Use SHAP values or LIME for explainable AI insights.

Obesity Risk Prediction Using Lifestyle Factors

Group 6

2025-04-15

Introduction / Background

Dataset Description

Data Loading and Initial Exploration

2. Exploratory Data Analysis (EDA)

3. Statistical Analysis

4. Machine Learning Models

5. Results

Final Remarks (Recommendations for Report)