The Dataset and Initial EDA

Author

Onesmus Kabui

About the Dataset

This dataset contains data from individuals from Mexico, Peru and Columbia regarding their habits, eating habits and physical condition. From this data we will estimate their obesity levels. Each entry contains 17 entries and we have 2111 entries in total. 23% of the data was collected though a web platform directly from users and the rest was generated synthetically. From the feature variables estimates of the target variable, NObeyesdad are obtained and the entry classified in one of the categories : Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II or Obesity Type III.

Variables Overview

This dataset has 17 variables with one target variable. The variables can actually be described in terms of categories.

  1. Demographic Information
  • Gender – Biological sex of the individual

  • Age – Age in years

  • Height – Height in meters

  • Weight – Weight in kilograms

    2. Eating Habits

  • family_history_with_overweight – Whether close family members are overweight

  • FAVC – Frequent consumption of high-calorie food (yes/no)

  • FCVC – Frequency of vegetable consumption (scale)

  • NCP – Number of main meals per day

  • CAEC – Consumption of food between meals (snacking)

  • SMOKE – Whether the individual smokes

  • CH2O – Daily water intake (liters)

  • SCC – Monitoring of daily calorie consumption (yes/no)

  • CALC – Frequency of alcohol consumption

    3. Physical Condition & Lifestyle

  • FAF – Physical activity frequency (hours per week)

  • TUE – Time spent using technology devices (hours per day)

  • MTRANS – Mode of transportation used (car, bike, public transport, walking, etc.)

    4. Target Variable

  • NObeyesdad – Obesity Level (Insufficient Weight, Normal Weight, Overweight I, Overweight II, Obesity Type I, Obesity Type II, Obesity Type III)

Initial EDA

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
setwd("C:/Users/user/Desktop/datasets")
obesity_data<-read.csv("ObesityDataSet_raw.csv")
glimpse(obesity_data)
Rows: 2,111
Columns: 17
$ Gender                         <chr> "Female", "Female", "Male", "Male", "Ma…
$ Age                            <dbl> 21, 21, 23, 27, 22, 29, 23, 22, 24, 22,…
$ Height                         <dbl> 1.62, 1.52, 1.80, 1.80, 1.78, 1.62, 1.5…
$ Weight                         <dbl> 64.0, 56.0, 77.0, 87.0, 89.8, 53.0, 55.…
$ family_history_with_overweight <chr> "yes", "yes", "yes", "no", "no", "no", …
$ FAVC                           <chr> "no", "no", "no", "no", "no", "yes", "y…
$ FCVC                           <dbl> 2, 3, 2, 3, 2, 2, 3, 2, 3, 2, 3, 2, 3, …
$ NCP                            <dbl> 3, 3, 3, 3, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
$ CAEC                           <chr> "Sometimes", "Sometimes", "Sometimes", …
$ SMOKE                          <chr> "no", "yes", "no", "no", "no", "no", "n…
$ CH2O                           <dbl> 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, …
$ SCC                            <chr> "no", "yes", "no", "no", "no", "no", "n…
$ FAF                            <dbl> 0, 3, 2, 2, 0, 0, 1, 3, 1, 1, 2, 2, 2, …
$ TUE                            <dbl> 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, …
$ CALC                           <chr> "no", "Sometimes", "Frequently", "Frequ…
$ MTRANS                         <chr> "Public_Transportation", "Public_Transp…
$ NObeyesdad                     <chr> "Normal_Weight", "Normal_Weight", "Norm…
summary(obesity_data)
    Gender               Age            Height          Weight      
 Length:2111        Min.   :14.00   Min.   :1.450   Min.   : 39.00  
 Class :character   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
 Mode  :character   Median :22.78   Median :1.700   Median : 83.00  
                    Mean   :24.31   Mean   :1.702   Mean   : 86.59  
                    3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
                    Max.   :61.00   Max.   :1.980   Max.   :173.00  
 family_history_with_overweight     FAVC                FCVC      
 Length:2111                    Length:2111        Min.   :1.000  
 Class :character               Class :character   1st Qu.:2.000  
 Mode  :character               Mode  :character   Median :2.386  
                                                   Mean   :2.419  
                                                   3rd Qu.:3.000  
                                                   Max.   :3.000  
      NCP            CAEC              SMOKE                CH2O      
 Min.   :1.000   Length:2111        Length:2111        Min.   :1.000  
 1st Qu.:2.659   Class :character   Class :character   1st Qu.:1.585  
 Median :3.000   Mode  :character   Mode  :character   Median :2.000  
 Mean   :2.686                                         Mean   :2.008  
 3rd Qu.:3.000                                         3rd Qu.:2.477  
 Max.   :4.000                                         Max.   :3.000  
     SCC                 FAF              TUE             CALC          
 Length:2111        Min.   :0.0000   Min.   :0.0000   Length:2111       
 Class :character   1st Qu.:0.1245   1st Qu.:0.0000   Class :character  
 Mode  :character   Median :1.0000   Median :0.6253   Mode  :character  
                    Mean   :1.0103   Mean   :0.6579                     
                    3rd Qu.:1.6667   3rd Qu.:1.0000                     
                    Max.   :3.0000   Max.   :2.0000                     
    MTRANS           NObeyesdad       
 Length:2111        Length:2111       
 Class :character   Class :character  
 Mode  :character   Mode  :character  
                                      
                                      
                                      
colSums(is.na(obesity_data))
                        Gender                            Age 
                             0                              0 
                        Height                         Weight 
                             0                              0 
family_history_with_overweight                           FAVC 
                             0                              0 
                          FCVC                            NCP 
                             0                              0 
                          CAEC                          SMOKE 
                             0                              0 
                          CH2O                            SCC 
                             0                              0 
                           FAF                            TUE 
                             0                              0 
                          CALC                         MTRANS 
                             0                              0 
                    NObeyesdad 
                             0 
#Understanding the target variable level balance
obesity_data %>% 
  count(NObeyesdad) %>% 
  ggplot(aes(x=NObeyesdad,y= n, fill=NObeyesdad))+
  geom_col()+
  labs(title = "obesity levels balance",x="obesity level",y="count")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) #to stop overlapping for x axis labels

In the initial EDA we have made some key observations regarding our obesity data. Using glimpse we have discovered that the data type in the columns are double and in character form, some columns that contain levels of entry then need to be converted to factor data type. We have also observed a quite consistent data for all numeric variables. This was also confirmed by the summary statistics, variable values falling within the reasonable range.

There are no missing values in this data for all variables. We also look at the target variable in depth and also plot to make sure the data is balanced and no classes dominate more as the model will only predict one class always! and fail on the lesser classes. From the plot obesity_type_1 has the highest count while insufficient_weight has the lowest count. But on average the deviation between the variable counts is minimal we can conclude that the target variable is balanced.

Data Cleaning and Preparation

Renaming Columns

obesity_data<-obesity_data %>%  
  rename(gender = Gender, age = Age, height = Height,weight = Weight,high_cal_food=FAVC,vegetable_consumption=FCVC,smoke=SMOKE, main_meals_number=NCP,snacking_between_meals=CAEC,water_intake=CH2O,calories_monitoring=SCC,alcohol_frequency=CALC,physical_activity=FAF,screen_time=TUE,transport_means=MTRANS,obesity_level=NObeyesdad)

glimpse(obesity_data,2) 
Rows: 2,111
Columns: 17
$ gender                         <chr> …
$ age                            <dbl> …
$ height                         <dbl> …
$ weight                         <dbl> …
$ family_history_with_overweight <chr> …
$ high_cal_food                  <chr> …
$ vegetable_consumption          <dbl> …
$ main_meals_number              <dbl> …
$ snacking_between_meals         <chr> …
$ smoke                          <chr> …
$ water_intake                   <dbl> …
$ calories_monitoring            <chr> …
$ physical_activity              <dbl> …
$ screen_time                    <dbl> …
$ alcohol_frequency              <chr> …
$ transport_means                <chr> …
$ obesity_level                  <chr> …

We have renamed the columns for readability and consistency. Done away with abbreviations and capital letters now we have readable texts and in lower case.

Checking data types

str(obesity_data) 
'data.frame':   2111 obs. of  17 variables:
 $ gender                        : chr  "Female" "Female" "Male" "Male" ...
 $ age                           : num  21 21 23 27 22 29 23 22 24 22 ...
 $ height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
 $ weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
 $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
 $ high_cal_food                 : chr  "no" "no" "no" "no" ...
 $ vegetable_consumption         : num  2 3 2 3 2 2 3 2 3 2 ...
 $ main_meals_number             : num  3 3 3 3 1 3 3 3 3 3 ...
 $ snacking_between_meals        : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
 $ smoke                         : chr  "no" "yes" "no" "no" ...
 $ water_intake                  : num  2 3 2 2 2 2 2 2 2 2 ...
 $ calories_monitoring           : chr  "no" "yes" "no" "no" ...
 $ physical_activity             : num  0 3 2 2 0 0 1 3 1 1 ...
 $ screen_time                   : num  1 0 1 0 0 0 0 0 1 1 ...
 $ alcohol_frequency             : chr  "no" "Sometimes" "Frequently" "Frequently" ...
 $ transport_means               : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
 $ obesity_level                 : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

We notice that all variables represented as character data type is best represented as a categorical variables. So we can go ahead and transform all character data to factor data types.

obesity_data<-obesity_data %>%   
  mutate(across(where(is.character),as.factor)) 
library(purrr)
obesity_data %>%   
  select(where(is.factor)) %>%  
  map(table)# see the different levels in the categoricall variables and their distribution.
$gender

Female   Male 
  1043   1068 

$family_history_with_overweight

  no  yes 
 385 1726 

$high_cal_food

  no  yes 
 245 1866 

$snacking_between_meals

    Always Frequently         no  Sometimes 
        53        242         51       1765 

$smoke

  no  yes 
2067   44 

$calories_monitoring

  no  yes 
2015   96 

$alcohol_frequency

    Always Frequently         no  Sometimes 
         1         70        639       1401 

$transport_means

           Automobile                  Bike             Motorbike 
                  457                     7                    11 
Public_Transportation               Walking 
                 1580                    56 

$obesity_level

Insufficient_Weight       Normal_Weight      Obesity_Type_I     Obesity_Type_II 
                272                 287                 351                 297 
   Obesity_Type_III  Overweight_Level_I Overweight_Level_II 
                324                 290                 290 

After transforming to factor and running str() again we can see columns as factors with their respective levels indicated. At this point our data is consistent and readable we can confirm that there are no missing values and proceed to EDA.

Check for missing values

colSums(is.na(obesity_data))
                        gender                            age 
                             0                              0 
                        height                         weight 
                             0                              0 
family_history_with_overweight                  high_cal_food 
                             0                              0 
         vegetable_consumption              main_meals_number 
                             0                              0 
        snacking_between_meals                          smoke 
                             0                              0 
                  water_intake            calories_monitoring 
                             0                              0 
             physical_activity                    screen_time 
                             0                              0 
             alcohol_frequency                transport_means 
                             0                              0 
                 obesity_level 
                             0 

There are no missing data.

Exploratory Data Analysis

In EDA we will explore univariate analysis of all variables, numeric and categorical, then proceed to bivariate analysis to understand relationship with the target variable and among variables then perform multivariate analysis to assess the interraction of multiple variables and their effects. From the results we will uncover the insights.

Univariate Analysis

Numeric Variables.
obesity_num_vars<- c("age", "height", "weight", "vegetable_consumption",
              "main_meals_number", "water_intake", 
              "physical_activity", "screen_time")
# Histograms to visualize distribution
obesity_data %>%
  select(all_of(obesity_num_vars)) %>%
  # pivot longer so ggplot2 can handle many variables in one plot
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value, fill = variable)) +
  geom_histogram(bins = 30, alpha = 0.7, color = "black") +
  # separate scale for each variable
  facet_wrap(~variable, scales = "free", ncol = 3) +
  theme_minimal() +
  labs(title = "Distribution of Numerical Variables")

From the univariate analysis of each numerical variable we can draw some conclusions about each variable.

Age~ from the histogram, we see that age has right skewness meaning that the participants in this data are relatively young.

height~the height variable has a normal distribution most participants having an average height

main meals number ~ variable shows an overwhelming number of people take 3 meals a day while others still take 1 and still 4 meals but very few.

physical activity~ most participants exercise 0-1 hour in a week with a small group doing 2-3 hours of exercise per week

screen time ~ most participants report less hours on the screens

vegetable consumption~ the histogram is left skewed meaning most participants report frequent vegetable consumption.

water intake~ participants water consumption is distributed across from 1 to 3 liters of water although many respondents report taking 2 liters of water.

weight~ weight is normally distributed but there appears to be obese and underweight participants weighing in above 100 kilos

Categorical variables
class(obesity_data)
[1] "data.frame"
obesity_factors <- obesity_data %>%
  select(where(is.factor)) %>% 
names()

str(obesity_factors)
 chr [1:9] "gender" "family_history_with_overweight" "high_cal_food" ...
obesity_data %>%
  select(all_of(obesity_factors)) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "value") %>%
  ggplot(aes(x = value, fill = value)) +
  geom_bar(show.legend = FALSE) +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Distribution of Categorical Variables",
       x = "Category",
       y = "Count")

From the distribution of categorical variables;

  • Alcohol consumption falls into two groups of sometimes and no meaning from the study many many participants are light drinkers or non drinkers.

  • Calories monitoring is heavily skewed towards no meaning almost all participants in the study do not monitor their calories intake but that makes perfect sense since not very many people really monitor calories intake

  • family history with overweight also shows an overwhelming response of yes meaning many participants had atleast one member who was overweight. This can also have alot of bias because of the error of judgement.

  • For gender which is actually very key we have a very fair representation of males and females.

  • high calories food many participants responded that they take high calorie food but we still had those who believed they consume low calorie food

  • obesity levels are very evenly distributed but type 1 obesity has the largest number of people classified in one group

  • For smoking most participants are non smokers so that will be interesting to see how the small number of smokers classify in the obesity levels to see its impact in the the ones weight.

  • snacking between meals as expected most people say they take snacks in between meals but to find out whether that has an effect in weight of a person we will see whether that can sufficiently predict someone obesity levels.

  • lastly for categorical variables almost the whole population under study use either public transportation and automobiles which probably have the same level of physical movement but a small sample uses walking as their means of movement and that will be interesting to see how they predict obesity levels.