The Dataset and Initial EDA

Author

Onesmus Kabui

About the Dataset

This dataset contains data from individuals from Mexico, Peru and Columbia regarding their habits; eating habits and physical condition. By applying statistical and machine learning models, we can move beyond description and actually predict obesity levels. Each entry contains 17 variables and we have 2111 entries in total. 23% of the data was collected though a web platform directly from users and the rest was generated synthetically. From the feature variables estimates of the target variable are classified into different categories; Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II or Obesity Type III.

Variables Overview

This dataset has 17 variables with one target variable. The variables can be described in terms of categories.

Demographic Information

Gender
Age – Age in years
Height – Height in meters
Weight – Weight in kilograms

2. Eating Habits
family_history_with_overweight – Whether close family members are overweight
FAVC – Frequent consumption of high-calorie food (yes/no)
FCVC – Frequency of vegetable consumption (scale)
NCP – Number of main meals per day
CAEC – Consumption of food between meals (snacking)
SMOKE – Whether the individual smokes
CH2O – Daily water intake (liters)
SCC – Monitoring of daily calorie consumption (yes/no)
CALC – Frequency of alcohol consumption

3. Physical Condition & Lifestyle
FAF – Physical activity frequency (hours per week)
TUE – Time spent using technology devices (hours per day)
MTRANS – Mode of transportation used (car, bike, public transport, walking, etc.)

4. Target Variable
NObeyesdad – Obesity Level (Insufficient Weight, Normal Weight, Overweight I, Overweight II, Obesity Type I, Obesity Type II, Obesity Type III)

Initial EDA

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(broom)

Warning: package 'broom' was built under R version 4.4.3

setwd("C:/Users/user/Desktop/datasets")
obesity_data<-read.csv("ObesityDataSet_raw.csv")
glimpse(obesity_data)

Rows: 2,111
Columns: 17
$ Gender                         <chr> "Female", "Female", "Male", "Male", "Ma…
$ Age                            <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,…
$ Height                         <dbl> 1.62, 1.52, 1.80, 1.80, 1.78, 1.62, 1.5…
$ Weight                         <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.…
$ family_history_with_overweight <chr> "yes", "yes", "yes", "no", "no", "no", …
$ FAVC                           <chr> "no", "no", "no", "no", "no", "yes", "y…
$ FCVC                           <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, …
$ NCP                            <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
$ CAEC                           <chr> "Sometimes", "Sometimes", "Sometimes", …
$ SMOKE                          <chr> "no", "yes", "no", "no", "no", "no", "n…
$ CH2O                           <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
$ SCC                            <chr> "no", "yes", "no", "no", "no", "no", "n…
$ FAF                            <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2, …
$ TUE                            <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, …
$ CALC                           <chr> "no", "Sometimes", "Frequently", "Frequ…
$ MTRANS                         <chr> "Public_Transportation", "Public_Transp…
$ NObeyesdad                     <chr> "Normal_Weight", "Normal_Weight", "Norm…

summary(obesity_data)

    Gender               Age            Height          Weight      
 Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
 Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
 Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
                    Mean   :24.31   Mean   :1.702   Mean   : 86.59  
                    3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
                    Max.   :61.00   Max.   :1.980   Max.   :173.00  
 family_history_with_overweight     FAVC                FCVC      
 Length:2111                    Length:2111        Min.   :1.000  
 Class :character               Class :character   1st Qu.:2.000  
 Mode  :character               Mode  :character   Median :2.386  
                                                   Mean   :2.419  
                                                   3rd Qu.:3.000  
                                                   Max.   :3.000  
      NCP            CAEC              SMOKE                CH2O      
 Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
 1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
 Median :3.000   Mode  :character   Mode  :character   Median :2.000  
 Mean   :2.686                                         Mean   :2.008  
 3rd Qu.:3.000                                         3rd Qu.:2.477  
 Max.   :4.000                                         Max.   :3.000  
     SCC                 FAF              TUE             CALC          
 Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
 Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
 Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
                    Mean   :1.0103   Mean   :0.6579                     
                    3rd Qu.:1.6667   3rd Qu.:1.0000                     
                    Max.   :3.0000   Max.   :2.0000                     
    MTRANS           NObeyesdad       
 Length:2111        Length:2111       
 Class :character   Class :character  
 Mode  :character   Mode  :character

colSums(is.na(obesity_data))

                        Gender                            Age 
                             0                              0 
                        Height                         Weight 
                             0                              0 
family_history_with_overweight                           FAVC 
                             0                              0 
                          FCVC                            NCP 
                             0                              0 
                          CAEC                          SMOKE 
                             0                              0 
                          CH2O                            SCC 
                             0                              0 
                           FAF                            TUE 
                             0                              0 
                          CALC                         MTRANS 
                             0                              0 
                    NObeyesdad 
                             0

#Understanding the target variable level balance
obesity_data %>% 
  count(NObeyesdad) %>% 
  ggplot(aes(x=NObeyesdad,y= n, fill=NObeyesdad))+
  geom_col()+
  labs(title = "obesity levels balance",x="obesity level",y="count")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) #to stop overlapping for x axis labels

In the initial EDA we have made some key observations regarding our obesity data. Using glimpse we have discovered that the data type in the columns are double and in character form, some columns that contain character data type need to be converted to factor data type. We have also observed consistency in data for all numeric variables. This was also confirmed by the summary statistics with variable values falling within the reasonable range.

There are no missing values in this data for all variables. We also look at the target variable in depth and also plot to make sure the data is balanced and no classes dominate more. From the plot obesity_type_1 has the highest count while insufficient_weight has the lowest count. But on average the deviation between the variable counts is minimal we can conclude that the target variable is balanced.

Data Cleaning and Preparation

Renaming Columns

obesity_data<-obesity_data %>%  
  rename(gender = Gender, age = Age, height = Height,weight = Weight,high_cal_food=FAVC,vegetable_consumption=FCVC,smoke=SMOKE, main_meals_number=NCP,snacking_between_meals=CAEC,water_intake=CH2O,calories_monitoring=SCC,alcohol_frequency=CALC,physical_activity=FAF,screen_time=TUE,transport_means=MTRANS,obesity_level=NObeyesdad)

glimpse(obesity_data,2)

Rows: 2,111
Columns: 17
$ gender                         <chr> …
$ age                            <dbl> …
$ height                         <dbl> …
$ weight                         <dbl> …
$ family_history_with_overweight <chr> …
$ high_cal_food                  <chr> …
$ vegetable_consumption          <dbl> …
$ main_meals_number              <dbl> …
$ snacking_between_meals         <chr> …
$ smoke                          <chr> …
$ water_intake                   <dbl> …
$ calories_monitoring            <chr> …
$ physical_activity              <dbl> …
$ screen_time                    <dbl> …
$ alcohol_frequency              <chr> …
$ transport_means                <chr> …
$ obesity_level                  <chr> …

We standardized variable names by removing abbreviations and capital letters to improve readability and consistency

Checking data types

str(obesity_data)

'data.frame':   2111 obs. of  17 variables:
 $ gender                        : chr  "Female" "Female" "Male" "Male" ...
 $ age                           : num  21 21 23 27 22 29 23 22 24 22 ...
 $ height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
 $ weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
 $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
 $ high_cal_food                 : chr  "no" "no" "no" "no" ...
 $ vegetable_consumption         : num  2 3 2 3 2 2 3 2 3 2 ...
 $ main_meals_number             : num  3 3 3 3 1 3 3 3 3 3 ...
 $ snacking_between_meals        : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
 $ smoke                         : chr  "no" "yes" "no" "no" ...
 $ water_intake                  : num  2 3 2 2 2 2 2 2 2 2 ...
 $ calories_monitoring           : chr  "no" "yes" "no" "no" ...
 $ physical_activity             : num  0 3 2 2 0 0 1 3 1 1 ...
 $ screen_time                   : num  1 0 1 0 0 0 0 0 1 1 ...
 $ alcohol_frequency             : chr  "no" "Sometimes" "Frequently" "Frequently" ...
 $ transport_means               : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
 $ obesity_level                 : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

We notice that all variables represented as character data type are best represented as a categorical variables. So we can go ahead and transform all character variables to factor data types.

obesity_data<-obesity_data %>%   
  mutate(across(where(is.character),as.factor)) 
library(purrr)
obesity_data %>%   
  select(where(is.factor)) %>%  
  map(table)# see the different levels in the categoricall variables and their distribution.

$gender

Female   Male 
  1043   1068 

$family_history_with_overweight

  no  yes 
 385 1726 

$high_cal_food

  no  yes 
 245 1866 

$snacking_between_meals

    Always Frequently         no  Sometimes 
        53        242         51       1765 

$smoke

  no  yes 
2067   44 

$calories_monitoring

  no  yes 
2015   96 

$alcohol_frequency

    Always Frequently         no  Sometimes 
         1         70        639       1401 

$transport_means

           Automobile                  Bike             Motorbike 
                  457                     7                    11 
Public_Transportation               Walking 
                 1580                    56 

$obesity_level

Insufficient_Weight       Normal_Weight      Obesity_Type_I     Obesity_Type_II 
                272                 287                 351                 297 
   Obesity_Type_III  Overweight_Level_I Overweight_Level_II 
                324                 290                 290

After transforming to factor and running str() again we can see columns as factors with their respective levels indicated. At this point our data is consistent and readable we can confirm that there are no missing values and proceed to EDA.

Check for missing values

colSums(is.na(obesity_data))

                        gender                            age 
                             0                              0 
                        height                         weight 
                             0                              0 
family_history_with_overweight                  high_cal_food 
                             0                              0 
         vegetable_consumption              main_meals_number 
                             0                              0 
        snacking_between_meals                          smoke 
                             0                              0 
                  water_intake            calories_monitoring 
                             0                              0 
             physical_activity                    screen_time 
                             0                              0 
             alcohol_frequency                transport_means 
                             0                              0 
                 obesity_level 
                             0

There are no missing data.

Exploratory Data Analysis

To capture the patterns and relationships in our data we carry out exploratory data analysis. To compare numeric variables with factor level target variable we use the ANOVA test and chi-square test to compare the relationship between categorical variables and target variables. After that we will see which features show which variables show strongest association with obesity level.

Univariate analysis of Numeric Variables.

obesity_num_vars<- c("age", "height", "weight", "vegetable_consumption",
              "main_meals_number", "water_intake", 
              "physical_activity", "screen_time")
# Histograms to visualize distribution
obesity_data %>%
  select(all_of(obesity_num_vars)) %>%
  # pivot longer so ggplot2 can handle many variables in one plot
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, fill = variable)) +
  geom_histogram(bins = 30, alpha = 0.7, color = "black") +
  # separate scale for each variable
  facet_wrap(~variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Distribution of Numerical Variables")

From the univariate analysis of each numerical variable we can draw some conclusions about each variable.

Age~ from the histogram, we see that age has right skewness meaning that the participants in this data are relatively young.

height~the height variable has a normal distribution most participants having an average height

main meals number ~ variable shows an overwhelming number of people take 3 meals a day while others still take 1 and still 4 meals but very few.

physical activity~ most participants exercise 0-1 hour in a week with a small group doing 2-3 hours of exercise per week

screen time ~ most participants report less hours on the screens

vegetable consumption~ the histogram is left skewed meaning most participants report frequent vegetable consumption.

water intake~ participants water consumption is distributed across from 1 to 3 liters of water although many respondents report taking 2 liters of water.

weight~ weight is normally distributed but there appears to be obese and underweight participants weighing in above 100 kilos

Univariate analysis of categorical variables

obesity_factors <- obesity_data %>%
  select(where(is.factor)) %>%
  select(-obesity_level) %>%
  names()
obesity_data %>%
  pivot_longer(cols = all_of(obesity_factors),
               names_to = "variable",
               values_to = "value") %>%
  ggplot(aes(x = value, fill = value)) +
  geom_bar(show.legend = FALSE) +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Univariate Distribution of Categorical Variables",
       x = "Category",
       y = "Count")

From the distribution of categorical variables;

Alcohol consumption falls into two groups of sometimes and no drinking indicating that most participants are light drinkers or non-drinkers..

Calories monitoring is heavily skewed towards no meaning almost all participants in the study do not monitor their calories intake which is expected, as few people typically monitor their calorie intake

family history with overweight also shows an overwhelming response of yes meaning many participants had at least one member who was overweight. This may introduce bias due to reliance on self-reported information

For gender which is actually very key we have a very fair representation of males and females.

high calories food many participants responded that they take high calorie food but we still had those who believed they consume low calorie food

obesity levels are very evenly distributed but type 1 obesity has the largest number of people classified in one group

For smoking most participants are non smokers so this may provide insights into how smoking relates to obesity levels.

snacking between meals as expected most people say they take snacks in between meals but to find out whether that has an effect in weight of a person we will see whether that can sufficiently predict someone obesity levels.

finally for categorical variables almost the whole population under study use either public transportation and automobiles which probably have the same level of physical movement but a small sample uses walking as their means of movement and that will be interesting to see how they predict obesity levels.

Numeric variables vs Obesity level

# ANOVA for each numeric predictor
num_results <- map_df(obesity_num_vars, function(var) {
  model <- aov(as.formula(paste(var, "~ obesity_level")), data=obesity_data)
  tidy <- broom::tidy(model)
  data.frame(
    Variable = var,
    p_value = tidy$p.value[1]
  )
})

num_results

               Variable       p_value
1                   age  3.592580e-88
2                height  1.685854e-44
3                weight  0.000000e+00
4 vegetable_consumption 3.732469e-123
5     main_meals_number  6.258632e-31
6          water_intake  2.837324e-18
7     physical_activity  7.653253e-20
8           screen_time  2.068782e-08

We applied ANOVA to test whether the mean values of numeric predictors differ across obesity categories. The null hypothesis stated that the means are equal across groups, while the alternative suggested at least one group differs. Results showed that all numeric variables (age, height, weight, vegetable consumption, main meals, water intake, physical activity, and screen time) had highly significant associations with obesity level (all p < 0.001). This indicates that differences in these behaviors and characteristics are strongly linked to variations in obesity status.

Categorical vs target variables

# Chi-square tests
cat_results <- map_df(obesity_factors, function(var) {
  tbl <- table(obesity_data[[var]], obesity_data$obesity_level)
  test <- chisq.test(tbl)
  data.frame(
    Variable = var,
    p_value = test$p.value
  )
})

Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
Warning in chisq.test(tbl): Chi-squared approximation may be incorrect

cat_results

                        Variable       p_value
1                         gender 8.088897e-139
2 family_history_with_overweight 4.228017e-131
3                  high_cal_food  1.482236e-47
4         snacking_between_meals 7.383853e-159
5                          smoke  1.535424e-05
6            calories_monitoring  3.773176e-24
7              alcohol_frequency  5.287158e-61
8                transport_means  5.177915e-48

Chi-square tests show that all categorical variables are significantly associated with obesity level (all p < 0.05). This means factors like gender, family history, snacking, calorie monitoring, alcohol use, and transport choices are not independent of obesity outcomes, with family history and snacking showing especially strong links.This suggests that lifestyle and demographic features contain valuable predictive information, which justifies proceeding to supervised modelling

Modelling

Train test split

We divided the dataset into training and testing sets to ensure fair model evaluation. The training set is used to learn patterns, while the testing set provides an unbiased measure of performance on unseen data. The training set takes 70% and test set 30%

library(caret)

Warning: package 'caret' was built under R version 4.4.3

Loading required package: lattice


Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

obesity_index<-createDataPartition(obesity_data$obesity_level,p=0.8,list = FALSE)
obesity_train<-obesity_data[obesity_index,]
obesity_test<-obesity_data[-obesity_index,]
# Check class distribution in training and testing sets
prop.table(table(obesity_train$obesity_level))


Insufficient_Weight       Normal_Weight      Obesity_Type_I     Obesity_Type_II 
          0.1289178           0.1360142           0.1661739           0.1407451 
   Obesity_Type_III  Overweight_Level_I Overweight_Level_II 
          0.1537552           0.1371969           0.1371969

prop.table(table(obesity_test$obesity_level))


Insufficient_Weight       Normal_Weight      Obesity_Type_I     Obesity_Type_II 
          0.1285714           0.1357143           0.1666667           0.1404762 
   Obesity_Type_III  Overweight_Level_I Overweight_Level_II 
          0.1523810           0.1380952           0.1380952

The split preserved class balance and in both sets all target levels are represented well.

Logistic regression

Our baseline model will be multinomial logistic regression because our target is multiclass. From this model we can evaluate the linear relationship between the predictors and obesity levels.

library(nnet)

Warning: package 'nnet' was built under R version 4.4.3

obesity_multinom<- multinom(obesity_level~., data = obesity_train)

# weights:  175 (144 variable)
initial  value 3290.534062 
iter  10 value 2696.220768
iter  20 value 1786.454136
iter  30 value 1345.128955
iter  40 value 1113.909119
iter  50 value 865.187418
iter  60 value 594.818878
iter  70 value 339.411134
iter  80 value 191.038813
iter  90 value 123.900039
iter 100 value 93.016927
final  value 93.016927 
stopped after 100 iterations

summary(obesity_multinom)

Warning in sqrt(diag(vc)): NaNs produced

Call:
multinom(formula = obesity_level ~ ., data = obesity_train)

Coefficients:
                    (Intercept) genderMale       age    height    weight
Normal_Weight          104.1123  -4.278847 0.4991415 -129.3605  3.308731
Obesity_Type_I         210.5711 -30.709864 1.1381531 -673.2513 11.434381
Obesity_Type_II        102.0568  41.334800 4.3494712 -911.7491 16.032531
Obesity_Type_III      -123.4957 -94.993682 1.7990186 -636.5243 14.038797
Overweight_Level_I     111.7723 -11.222102 0.6333993 -309.6478  5.859537
Overweight_Level_II    145.4530 -10.312374 1.0157687 -452.7254  7.827319
                    family_history_with_overweightyes high_cal_foodyes
Normal_Weight                              -2.6047075       -1.8657986
Obesity_Type_I                             -3.1191102       -7.5481436
Obesity_Type_II                           -41.2518988      -28.0742199
Obesity_Type_III                          -54.1964366      -22.0588749
Overweight_Level_I                         -2.5655913        0.2336188
Overweight_Level_II                         0.9962695       -4.4791510
                    vegetable_consumption main_meals_number
Normal_Weight                   -3.127820        -1.4281905
Obesity_Type_I                  -5.271185        -0.1060674
Obesity_Type_II                -12.201466        11.6580018
Obesity_Type_III                45.691937        13.3084770
Overweight_Level_I              -4.486386        -1.1692326
Overweight_Level_II             -4.800903        -2.2046927
                    snacking_between_mealsFrequently snacking_between_mealsno
Normal_Weight                              -2.939054               6.00810121
Obesity_Type_I                             32.439242               0.06834716
Obesity_Type_II                          -131.556768              56.58320985
Obesity_Type_III                           96.722104              51.35349705
Overweight_Level_I                         -1.288679              13.48201232
Overweight_Level_II                         4.561869            -292.29707844
                    snacking_between_mealsSometimes smokeyes water_intake
Normal_Weight                              2.198705 75.39868    -4.439285
Obesity_Type_I                            51.736557 73.19997    -3.990869
Obesity_Type_II                           39.975907 74.29580   -22.784351
Obesity_Type_III                          60.234403 45.60579   -13.946177
Overweight_Level_I                         9.081797 70.88715    -4.917914
Overweight_Level_II                       17.412384 77.40542    -4.534858
                    calories_monitoringyes physical_activity screen_time
Normal_Weight                     2.457481         -1.097591  -0.6396165
Obesity_Type_I                  -22.764238         -5.921336   1.9877657
Obesity_Type_II                -101.801108        -13.322538   3.4984150
Obesity_Type_III                -27.776178        -37.819124  15.9342576
Overweight_Level_I                7.318967         -2.478605   0.1790756
Overweight_Level_II              -1.201158         -3.609689   2.7910223
                    alcohol_frequencyFrequently alcohol_frequencyno
Normal_Weight                       -51.0763638           -48.00484
Obesity_Type_I                       80.5468440            72.99744
Obesity_Type_II                      23.5668446            53.54759
Obesity_Type_III                      0.1943684           -79.78410
Overweight_Level_I                   62.3486600            64.97161
Overweight_Level_II                 103.5110480           103.99548
                    alcohol_frequencySometimes transport_meansBike
Normal_Weight                        -49.61395           110.28588
Obesity_Type_I                        70.08204           -50.30027
Obesity_Type_II                       17.94064            55.93853
Obesity_Type_III                     -59.43774            89.18470
Overweight_Level_I                    63.90688           114.38589
Overweight_Level_II                   99.59264          -157.33314
                    transport_meansMotorbike
Normal_Weight                       93.63745
Obesity_Type_I                     128.93915
Obesity_Type_II                     17.05583
Obesity_Type_III                    77.61370
Overweight_Level_I                -148.60028
Overweight_Level_II                103.31376
                    transport_meansPublic_Transportation transport_meansWalking
Normal_Weight                                   3.492772             -0.1235635
Obesity_Type_I                                 11.226661             -7.0806717
Obesity_Type_II                                28.796208             52.2786915
Obesity_Type_III                               34.240726             52.2467659
Overweight_Level_I                              3.334457             -1.0089316
Overweight_Level_II                             7.351153             -5.7946332

Std. Errors:
                    (Intercept)   genderMale       age   height    weight
Normal_Weight         5.7584244 2.0321196370 0.1297105 3.525876 0.1999559
Obesity_Type_I        6.3850450 3.5086848643 0.2333727 5.839728 0.2504084
Obesity_Type_II       1.6034950 8.4476100785 0.5167610 3.754929 0.2588414
Obesity_Type_III      0.8796131 0.0001534714 2.1979180 1.863654 0.5110704
Overweight_Level_I    4.9489838 2.5778526919 0.1652182 3.788064 0.2234592
Overweight_Level_II   5.6855432 2.7751143290 0.1826088 3.446646 0.2093502
                    family_history_with_overweightyes high_cal_foodyes
Normal_Weight                                1.746925        3.2549955
Obesity_Type_I                               3.085801        4.4329660
Obesity_Type_II                              9.161301       10.8911187
Obesity_Type_III                             0.877410        0.6700328
Overweight_Level_I                           2.032227        3.4737893
Overweight_Level_II                          2.370496        3.7356907
                    vegetable_consumption main_meals_number
Normal_Weight                    1.454700          1.016843
Obesity_Type_I                   2.727545          1.577503
Obesity_Type_II                  4.696327          4.534105
Obesity_Type_III                 1.050994          2.640908
Overweight_Level_I               1.744764          1.194525
Overweight_Level_II              2.038112          1.349582
                    snacking_between_mealsFrequently snacking_between_mealsno
Normal_Weight                           2.712062e+00             2.272892e+00
Obesity_Type_I                          4.284903e+00             2.700606e-01
Obesity_Type_II                         1.301948e-13             2.700606e-01
Obesity_Type_III                        3.676033e+00                      NaN
Overweight_Level_I                      3.460828e+00             2.397051e+00
Overweight_Level_II                     4.016013e+00            3.275319e-125
                    snacking_between_mealsSometimes   smokeyes water_intake
Normal_Weight                              2.740806 3.04097078     1.321772
Obesity_Type_I                             3.178326 4.02921346     1.989293
Obesity_Type_II                            5.825934 0.33492487     4.009075
Obesity_Type_III                           2.800190 0.02163595     1.542118
Overweight_Level_I                         3.297938 2.15949243     1.614487
Overweight_Level_II                        3.284437 2.20578727     1.761118
                    calories_monitoringyes physical_activity screen_time
Normal_Weight                    6.1716255         0.5702395    1.041604
Obesity_Type_I                   0.2788397         1.1625524    2.011484
Obesity_Type_II                  0.2788397         4.4854013    3.359948
Obesity_Type_III                 3.6760123         1.2638821    7.859370
Overweight_Level_I               6.4329888         0.8390054    1.330410
Overweight_Level_II             10.2885490         0.9499527    1.536210
                    alcohol_frequencyFrequently alcohol_frequencyno
Normal_Weight                         1.9768810            2.727852
Obesity_Type_I                        2.2933192            3.101476
Obesity_Type_II                       0.9410388            3.431810
Obesity_Type_III                      3.6763606            3.458693
Overweight_Level_I                    1.8955185            2.472954
Overweight_Level_II                   1.9309120            2.703426
                    alcohol_frequencySometimes transport_meansBike
Normal_Weight                        2.6867910        1.334655e+00
Obesity_Type_I                       3.1780584        1.012594e-13
Obesity_Type_II                      3.9776024        1.466677e-15
Obesity_Type_III                     0.6698108        1.330867e-16
Overweight_Level_I                   2.5181176        1.334655e+00
Overweight_Level_II                  2.8404125       6.277023e-123
                    transport_meansMotorbike
Normal_Weight                   5.724671e-06
Obesity_Type_I                  7.480840e-08
Obesity_Type_II                 1.793157e-14
Obesity_Type_III                         NaN
Overweight_Level_I             8.480217e-104
Overweight_Level_II             5.747207e-06
                    transport_meansPublic_Transportation transport_meansWalking
Normal_Weight                                   1.347745           2.2907137477
Obesity_Type_I                                  3.699480           2.9685964103
Obesity_Type_II                                 6.693485           0.0364574428
Obesity_Type_III                                4.336176           0.0001339752
Overweight_Level_I                              2.169722           1.6734734946
Overweight_Level_II                             2.710172           1.6497269184

Residual Deviance: 186.0339 
AIC: 474.0339

obesity_pred<- predict(obesity_multinom, newdata = obesity_test) #prediction
table(predicted= obesity_pred, actual=obesity_test$obesity_level)# confusion matrix

                     actual
predicted             Insufficient_Weight Normal_Weight Obesity_Type_I
  Insufficient_Weight                  51             6              0
  Normal_Weight                         3            51              0
  Obesity_Type_I                        0             0             66
  Obesity_Type_II                       0             0              1
  Obesity_Type_III                      0             0              0
  Overweight_Level_I                    0             0              0
  Overweight_Level_II                   0             0              3
                     actual
predicted             Obesity_Type_II Obesity_Type_III Overweight_Level_I
  Insufficient_Weight               0                0                  0
  Normal_Weight                     0                0                  1
  Obesity_Type_I                    3                0                  0
  Obesity_Type_II                  55                1                  0
  Obesity_Type_III                  1               63                  0
  Overweight_Level_I                0                0                 54
  Overweight_Level_II               0                0                  3
                     actual
predicted             Overweight_Level_II
  Insufficient_Weight                   0
  Normal_Weight                         0
  Obesity_Type_I                        1
  Obesity_Type_II                       1
  Obesity_Type_III                      1
  Overweight_Level_I                    6
  Overweight_Level_II                  49

mean(obesity_pred==obesity_test$obesity_level)#classification accuracy

[1] 0.9261905

The confusion matrix performed well in classifying extreme classes such as insufficient weight and obesity type 3. Most confusion in the model classification appear between middle classes such as normal weight and obesity type 1. The accuracy test shows an accuracy score of 93.8% meaning our model accurately predicts 94 for 100 cases. We can conclude that the model is dependable inn distinguishing between different obesity levels.

Decision tree

From decision trees we will get more interpretability, capture non linear relationships as well as feature importance.

library(rpart)

Warning: package 'rpart' was built under R version 4.4.3

library(rpart.plot)

Warning: package 'rpart.plot' was built under R version 4.4.3

obesity_tree<-rpart(obesity_level~.,data = obesity_train,method = "class",)#fit decision tree model
rpart.plot(obesity_tree, type = 4, extra = 104, under = TRUE, faclen = 3)

Warning: All boxes will be white (the box.palette argument will be ignored) because
the number of classes in the response 7 is greater than length(box.palette) 6.
To silence this warning use box.palette=0 or trace=-1.

#prediction
obesity_pred2<-predict(obesity_tree, newdata = obesity_test,type = "class")
#confusion  matrix
table(predicted=obesity_pred2,actual=obesity_test$obesity_level)

                     actual
predicted             Insufficient_Weight Normal_Weight Obesity_Type_I
  Insufficient_Weight                  51             9              0
  Normal_Weight                         3            38              0
  Obesity_Type_I                        0             0             60
  Obesity_Type_II                       0             0              0
  Obesity_Type_III                      0             0              0
  Overweight_Level_I                    0             9              3
  Overweight_Level_II                   0             1              7
                     actual
predicted             Obesity_Type_II Obesity_Type_III Overweight_Level_I
  Insufficient_Weight               0                0                  0
  Normal_Weight                     0                0                  5
  Obesity_Type_I                    8                0                  0
  Obesity_Type_II                  50                0                  0
  Obesity_Type_III                  1               64                  0
  Overweight_Level_I                0                0                 49
  Overweight_Level_II               0                0                  4
                     actual
predicted             Overweight_Level_II
  Insufficient_Weight                   0
  Normal_Weight                         1
  Obesity_Type_I                        2
  Obesity_Type_II                       0
  Obesity_Type_III                      0
  Overweight_Level_I                    1
  Overweight_Level_II                  54

#accuracy test
mean(obesity_pred2==obesity_test$obesity_level)

[1] 0.8714286

#variable importance
obesity_tree$variable.importance

                        weight                         height 
                   732.9146993                    506.4939233 
                        gender          vegetable_consumption 
                   289.5898971                    263.5182519 
                           age              physical_activity 
                   240.9849001                    122.5034810 
             main_meals_number              alcohol_frequency 
                   109.2679343                     72.8925605 
        snacking_between_meals                   water_intake 
                    64.0684837                     62.0866234 
family_history_with_overweight                  high_cal_food 
                    53.0837527                     41.8834074 
                   screen_time                transport_means 
                    32.8847711                      0.7713107

#variable importance plot
variable_importance<-sort(obesity_tree$variable.importance,decreasing = TRUE)
barplot(variable_importance,main = "Variable importance in obesity data",las = 2, cex.names = 0.8,col = blues9)

This decision tree threw an over fitting warning but we initially proceeded to check how it performs. It had an accuracy score of 87%, lower than multi-nomial logistic regression from the more confusion in normal weight vs overweight1 and overweight2. From the variable importance, weight and height had the biggest influence in the splits of obesity level while screen time and means of transport has the least influence.

Random Forests

Random forest is more versatile than single decision tree reducing the risk of overfitting and providing higher accuracy by combining many decision trees to reduce model variance.

set.seed(66)
library(randomForest)

Warning: package 'randomForest' was built under R version 4.4.3

randomForest 4.7-1.2

Type rfNews() to see new features/changes/bug fixes.


Attaching package: 'randomForest'

The following object is masked from 'package:dplyr':

    combine

The following object is masked from 'package:ggplot2':

    margin

obesity_forest<-randomForest(obesity_level~.,obesity_train, ntree=500,mtry=3,importance=TRUE)#random forest with 500 trees
print(obesity_forest)


Call:
 randomForest(formula = obesity_level ~ ., data = obesity_train,      ntree = 500, mtry = 3, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 4.49%
Confusion matrix:
                    Insufficient_Weight Normal_Weight Obesity_Type_I
Insufficient_Weight                 207            11              0
Normal_Weight                         2           221              0
Obesity_Type_I                        0             3            272
Obesity_Type_II                       0             1              1
Obesity_Type_III                      0             1              0
Overweight_Level_I                    0            24              1
Overweight_Level_II                   0             8              2
                    Obesity_Type_II Obesity_Type_III Overweight_Level_I
Insufficient_Weight               0                0                  0
Normal_Weight                     0                0                  3
Obesity_Type_I                    0                0                  1
Obesity_Type_II                 236                0                  0
Obesity_Type_III                  0              259                  0
Overweight_Level_I                0                0                201
Overweight_Level_II               0                0                  3
                    Overweight_Level_II class.error
Insufficient_Weight                   0 0.050458716
Normal_Weight                         4 0.039130435
Obesity_Type_I                        5 0.032028470
Obesity_Type_II                       0 0.008403361
Obesity_Type_III                      0 0.003846154
Overweight_Level_I                    6 0.133620690
Overweight_Level_II                 219 0.056034483

#variable importance
importance(obesity_forest)

                               Insufficient_Weight Normal_Weight Obesity_Type_I
gender                                   18.584227    11.9499659      28.920042
age                                      37.940567     7.5368000      41.154752
height                                   30.187397    19.2803844      40.785216
weight                                   72.689662    56.9798854      71.757536
family_history_with_overweight           27.578458     4.4307175      26.339004
high_cal_food                            12.360389    -1.4777621      16.516865
vegetable_consumption                    22.999228    14.5804221      35.163607
main_meals_number                        26.185780     6.6540364      30.065424
snacking_between_meals                   28.927954    15.4657580      20.631067
smoke                                     4.813657    -0.7137195       1.520536
water_intake                             18.709977    18.0248212      27.228552
calories_monitoring                       4.726064    -2.4370918       7.347807
physical_activity                        21.042054    10.0616051      24.978822
screen_time                              25.250656    12.9705780      25.177594
alcohol_frequency                        21.625143    -1.6121426      27.081465
transport_means                          21.270918     1.2233693      18.523320
                               Obesity_Type_II Obesity_Type_III
gender                               26.556773        34.430020
age                                  36.156543        17.127462
height                               19.682894        14.554765
weight                               86.795677        57.744329
family_history_with_overweight       15.538555        15.845010
high_cal_food                         7.103130         8.143753
vegetable_consumption                26.584662        27.765556
main_meals_number                    24.024529        15.737880
snacking_between_meals               14.755363        14.517081
smoke                                 1.752913         2.586652
water_intake                         24.205783        10.054779
calories_monitoring                   4.127308         3.728301
physical_activity                    20.678003        12.101865
screen_time                          16.775447        15.585637
alcohol_frequency                    19.265735        15.097987
transport_means                      10.414462        12.282992
                               Overweight_Level_I Overweight_Level_II
gender                                  21.439257           25.170943
age                                     38.515125           43.209371
height                                  35.861622           41.831496
weight                                  58.625319           56.816324
family_history_with_overweight          21.416759           23.315480
high_cal_food                           14.003357           26.914512
vegetable_consumption                   25.045881           27.004729
main_meals_number                       25.835600           25.698955
snacking_between_meals                  21.855758           17.747697
smoke                                    2.321436            1.253901
water_intake                            21.337982           23.317667
calories_monitoring                     14.028463            7.192082
physical_activity                       22.942599           20.773299
screen_time                             21.571321           23.572985
alcohol_frequency                       26.171887           28.542572
transport_means                         18.193353           18.268737
                               MeanDecreaseAccuracy MeanDecreaseGini
gender                                    37.154206        79.463865
age                                       47.936099       143.277224
height                                    48.000157       132.811274
weight                                    95.271588       438.967826
family_history_with_overweight            30.140734        42.522979
high_cal_food                             25.854908        21.494648
vegetable_consumption                     36.714298       126.191113
main_meals_number                         35.798921        76.820863
snacking_between_meals                    30.348916        48.743945
smoke                                      4.827222         3.356557
water_intake                              41.039716        67.422554
calories_monitoring                       14.732862         8.486069
physical_activity                         34.889973        70.721554
screen_time                               30.720330        69.343741
alcohol_frequency                         31.601517        47.335936
transport_means                           26.519560        32.961566

varImpPlot(obesity_forest,main = "Variable importance random forests")

#prediction
obesity_pred3<-predict(obesity_forest,newdata = obesity_test)
#confusion matrix
table(predicted=obesity_pred3,actual=obesity_test$obesity_level)

                     actual
predicted             Insufficient_Weight Normal_Weight Obesity_Type_I
  Insufficient_Weight                  48             0              0
  Normal_Weight                         6            56              0
  Obesity_Type_I                        0             0             64
  Obesity_Type_II                       0             0              0
  Obesity_Type_III                      0             0              0
  Overweight_Level_I                    0             1              2
  Overweight_Level_II                   0             0              4
                     actual
predicted             Obesity_Type_II Obesity_Type_III Overweight_Level_I
  Insufficient_Weight               0                0                  0
  Normal_Weight                     0                0                  4
  Obesity_Type_I                    1                0                  0
  Obesity_Type_II                  57                0                  0
  Obesity_Type_III                  1               64                  0
  Overweight_Level_I                0                0                 53
  Overweight_Level_II               0                0                  1
                     actual
predicted             Overweight_Level_II
  Insufficient_Weight                   0
  Normal_Weight                         5
  Obesity_Type_I                        0
  Obesity_Type_II                       0
  Obesity_Type_III                      0
  Overweight_Level_I                    4
  Overweight_Level_II                  49

#accuracy test
mean(obesity_pred3==obesity_test$obesity_level)

[1] 0.9309524

The random forest model achieved an OOB error of 5.03% , our model has extremely low error rate. This low error rate is validated by a score accuracy of 94.52% demonstrating a reliability in classifying obesity level. It also had misclassifications between normal weight and overweight level 1. From feature importance, the model identified the most influential features for predicting different obesity levels were; weight (MeanDecreaseAccuracy: 94.48) height (MeanDecreaseAccuracy: 48.79) age (MeanDecreaseAccuracy: 47.63) For behavioral factors, vegetable consumption and water intake were the most influential.

Feature scaling

I am going to encode categorical variables into numerical formats using dummy variables for categorical variables. With dummy numeric variables and scaled variables, and K-nearest neighbors uses distance metrics we can avoid disproportional influence in distance calculation.

library(class)

Warning: package 'class' was built under R version 4.4.3

library(caret)
# Identify predictors
predictors <- setdiff(names(obesity_train), "obesity_level")

# Create dummy variables for categorical predictors
dummies <- dummyVars(obesity_level ~ ., data= obesity_train)

train_transformed <- data.frame(predict(dummies, newdata = obesity_train))

Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$lvls): variable 'obesity_level' is not a factor

train_transformed$obesity_level <- obesity_train$obesity_level

test_transformed <- data.frame(predict(dummies, newdata = obesity_test))

Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
object$lvls): variable 'obesity_level' is not a factor

test_transformed$obesity_level <- obesity_test$obesity_level

str(train_transformed)

'data.frame':   1691 obs. of  32 variables:
 $ gender.Female                        : num  1 1 0 0 0 1 0 0 0 0 ...
 $ gender.Male                          : num  0 0 1 1 1 0 1 1 1 1 ...
 $ age                                  : num  21 21 23 27 22 23 22 24 22 26 ...
 $ height                               : num  1.62 1.52 1.8 1.8 1.78 1.5 1.64 1.78 1.72 1.85 ...
 $ weight                               : num  64 56 77 87 89.8 55 53 64 68 105 ...
 $ family_history_with_overweight.no    : num  0 0 0 1 1 0 1 0 0 0 ...
 $ family_history_with_overweight.yes   : num  1 1 1 0 0 1 0 1 1 1 ...
 $ high_cal_food.no                     : num  1 1 1 1 1 0 1 0 0 0 ...
 $ high_cal_food.yes                    : num  0 0 0 0 0 1 0 1 1 1 ...
 $ vegetable_consumption                : num  2 3 2 3 2 3 2 3 2 3 ...
 $ main_meals_number                    : num  3 3 3 3 1 3 3 3 3 3 ...
 $ snacking_between_meals.Always        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ snacking_between_meals.Frequently    : num  0 0 0 0 0 0 0 0 0 1 ...
 $ snacking_between_meals.no            : num  0 0 0 0 0 0 0 0 0 0 ...
 $ snacking_between_meals.Sometimes     : num  1 1 1 1 1 1 1 1 1 0 ...
 $ smoke.no                             : num  1 0 1 1 1 1 1 1 1 1 ...
 $ smoke.yes                            : num  0 1 0 0 0 0 0 0 0 0 ...
 $ water_intake                         : num  2 3 2 2 2 2 2 2 2 3 ...
 $ calories_monitoring.no               : num  1 0 1 1 1 1 1 1 1 1 ...
 $ calories_monitoring.yes              : num  0 1 0 0 0 0 0 0 0 0 ...
 $ physical_activity                    : num  0 3 2 2 0 1 3 1 1 2 ...
 $ screen_time                          : num  1 0 1 0 0 0 0 1 1 2 ...
 $ alcohol_frequency.Always             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ alcohol_frequency.Frequently         : num  0 0 1 1 0 0 0 1 0 0 ...
 $ alcohol_frequency.no                 : num  1 0 0 0 0 0 0 0 1 0 ...
 $ alcohol_frequency.Sometimes          : num  0 1 0 0 1 1 1 0 0 1 ...
 $ transport_means.Automobile           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ transport_means.Bike                 : num  0 0 0 0 0 0 0 0 0 0 ...
 $ transport_means.Motorbike            : num  0 0 0 0 0 1 0 0 0 0 ...
 $ transport_means.Public_Transportation: num  1 1 1 0 1 0 1 1 1 1 ...
 $ transport_means.Walking              : num  0 0 0 1 0 0 0 0 0 0 ...
 $ obesity_level                        : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 3 ...

#Scaling numeric variables.
scaled<- preProcess(train_transformed, method = c("center", "scale"))

train_ready <- predict(scaled, train_transformed)
test_ready  <- predict(scaled, test_transformed)
str(train_ready)

'data.frame':   1691 obs. of  32 variables:
 $ gender.Female                        : num  1.015 1.015 -0.985 -0.985 -0.985 ...
 $ gender.Male                          : num  -1.015 -1.015 0.985 0.985 0.985 ...
 $ age                                  : num  -0.515 -0.515 -0.196 0.444 -0.355 ...
 $ height                               : num  -0.892 -1.96 1.029 1.029 0.815 ...
 $ weight                               : num  -0.87008 -1.17482 -0.37488 0.00604 0.1127 ...
 $ family_history_with_overweight.no    : num  -0.454 -0.454 -0.454 2.202 2.202 ...
 $ family_history_with_overweight.yes   : num  0.454 0.454 0.454 -2.202 -2.202 ...
 $ high_cal_food.no                     : num  2.83 2.83 2.83 2.83 2.83 ...
 $ high_cal_food.yes                    : num  -2.83 -2.83 -2.83 -2.83 -2.83 ...
 $ vegetable_consumption                : num  -0.787 1.093 -0.787 1.093 -0.787 ...
 $ main_meals_number                    : num  0.409 0.409 0.409 0.409 -2.156 ...
 $ snacking_between_meals.Always        : num  -0.154 -0.154 -0.154 -0.154 -0.154 ...
 $ snacking_between_meals.Frequently    : num  -0.356 -0.356 -0.356 -0.356 -0.356 ...
 $ snacking_between_meals.no            : num  -0.154 -0.154 -0.154 -0.154 -0.154 ...
 $ snacking_between_meals.Sometimes     : num  0.434 0.434 0.434 0.434 0.434 ...
 $ smoke.no                             : num  0.145 -6.877 0.145 0.145 0.145 ...
 $ smoke.yes                            : num  -0.145 6.877 -0.145 -0.145 -0.145 ...
 $ water_intake                         : num  -0.00813 1.62309 -0.00813 -0.00813 -0.00813 ...
 $ calories_monitoring.no               : num  0.211 -4.741 0.211 0.211 0.211 ...
 $ calories_monitoring.yes              : num  -0.211 4.741 -0.211 -0.211 -0.211 ...
 $ physical_activity                    : num  -1.2 2.33 1.15 1.15 -1.2 ...
 $ screen_time                          : num  0.544 -1.097 0.544 -1.097 -1.097 ...
 $ alcohol_frequency.Always             : num  -0.0243 -0.0243 -0.0243 -0.0243 -0.0243 ...
 $ alcohol_frequency.Frequently         : num  -0.192 -0.192 5.212 5.212 -0.192 ...
 $ alcohol_frequency.no                 : num  1.543 -0.648 -0.648 -0.648 -0.648 ...
 $ alcohol_frequency.Sometimes          : num  -1.419 0.704 -1.419 -1.419 0.704 ...
 $ transport_means.Automobile           : num  -0.517 -0.517 -0.517 -0.517 -0.517 ...
 $ transport_means.Bike                 : num  -0.0544 -0.0544 -0.0544 -0.0544 -0.0544 ...
 $ transport_means.Motorbike            : num  -0.0689 -0.0689 -0.0689 -0.0689 -0.0689 ...
 $ transport_means.Public_Transportation: num  0.572 0.572 0.572 -1.747 0.572 ...
 $ transport_means.Walking              : num  -0.169 -0.169 -0.169 5.913 -0.169 ...
 $ obesity_level                        : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 3 ...

# Separate predictors and labels
train_x <- subset(train_ready, select = -obesity_level)
train_y <- train_ready$obesity_level

test_x  <- subset(test_ready, select = -obesity_level)
test_y  <- test_ready$obesity_level

K Nearest Neighbors

K nearest neighbors works by finding K most similar data points in a training set to a new unclassified point and in our case assigning it to the level with the majority of the K neighbors.

set.seed(66)
obesity_pred4 <- knn(train = train_x,
                test  = test_x,
                cl    = train_y,
                k     = 3) # kNN with k =3

# Confusion matrix
table(predicted = obesity_pred4, actual = test_y)

                     actual
predicted             Insufficient_Weight Normal_Weight Obesity_Type_I
  Insufficient_Weight                  47            11              1
  Normal_Weight                         6            29              4
  Obesity_Type_I                        0             0             55
  Obesity_Type_II                       0             1              0
  Obesity_Type_III                      0             0              0
  Overweight_Level_I                    0             6              2
  Overweight_Level_II                   1            10              8
                     actual
predicted             Obesity_Type_II Obesity_Type_III Overweight_Level_I
  Insufficient_Weight               0                0                  5
  Normal_Weight                     1                0                  3
  Obesity_Type_I                    3                0                  3
  Obesity_Type_II                  54                0                  0
  Obesity_Type_III                  0               64                  0
  Overweight_Level_I                0                0                 46
  Overweight_Level_II               1                0                  1
                     actual
predicted             Overweight_Level_II
  Insufficient_Weight                   1
  Normal_Weight                         6
  Obesity_Type_I                        3
  Obesity_Type_II                       0
  Obesity_Type_III                      0
  Overweight_Level_I                    4
  Overweight_Level_II                  44

# Accuracy
mean(obesity_pred4 == test_y)

[1] 0.8071429

KNN correctly classified 85.2% of test observations. This is lower than random forests this suggests that the relationship in the features is highly non-linear which is better captured in random forests. The KNN model had difficulty separating insufficient weight from normal weight misclassifying 9 insufficient weight as normal weight. The model had very high accuracy, correctly classifying 63 out of 65 observations as obesity type 3. The confusion matrix also suggests the model struggles with boundary levels but performs best at extreme levels.

Conclusions and Recommendations

This study successfully developed 4 statistical models to classify obesity levels into 7 distinct categories.
Of the 4 models, the best performing model was the random forests model achieving 94.5% accurate prediction rate on an independent test set.
This was a demonstration of the need for complex non-linear model. Feature importance of the variable showed physical factors are the primary predictors to predict obesity level; those are weight, height, and age.
Behavioral factors such as water intake and vegetable consumption also ranked high for predicting obesity level.
The lowest ranking features for predicting obesity level were means of transport and time spent on screens this showed that the means of transport was not significant in predicting obesity levels.
In comparison physical activity ranked high because it influences calorie burning.
Number of meals per day was an important predictor as more meals per day translate to more calorie intake and more chances of obesity.

Public Health Application

The high accuracy of the Random Forest model suggests it could be implemented in clinical or public health screening tools to rapidly and reliably classify an individual’s obesity risk based on easily collected data. This early, accurate classification can facilitate timely, targeted health interventions.

Limitations

This study is limited by a single data set. This data should be validated by diverse data geographically, to confirm it’s robustness in predicting obesity level and feature importance for different populations.

Future work

Further tuning of the random forest model
Regression task: generating BMI feature and predicting it as the response instead of obesity level factor levels.