1 Introduction

Obesity is a complex global health challenge with roots extending beyond individual lifestyle choices. Despite the modern trend regarding health consciousness which encourage people to embrace healthier habits, including better diet and doing more exercise, obesity remains a prevalent issue.

Based on the dataset that estimate obesity levels in individual from Mexico, Peru and Colombia in 2019, this project aims to uncover hidden patterns causing obesity in these three nations by using association rule mining.

2 Data Processing

For more information regarding this dataset, 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, and 23% of the data was collected directly from users through a web platform by 5 researchers in Colombia (De-La-Hoz-Correa et al., 2019).

In this dataset, there is 1 target which is the level of obesity and 16 features including gender, age, height, weight, family history with over weight, and these questions:
Do you eat high caloric food frequently?
Do you usually eat vegetables in your meals?
How many main meals do you have daily?
Do you eat any food between meals?
Do you smoke?
How much water do you drink daily?
Do you monitor the calories you eat daily?
How often do you have physical activity (per week)?
How much time do you use technological devices such as cell phone, video games, television, computer and others daily?
How often do you drink alcohol?
Which transportation do you usually use?

During the first step of data processing, I will explore the data, checking the structure and remove duplicate observations.

df <- read.csv("obesity.csv", sep = ",")
str(df)

## 'data.frame':    2111 obs. of  17 variables:
##  $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
##  $ high.caloric.food.frequently  : chr  "no" "no" "no" "no" ...
##  $ vegetables.in.meals           : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ number.of.main.meals          : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ food.between.meals            : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ smoke                         : chr  "no" "yes" "no" "no" ...
##  $ water.daily                   : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ monitor.calories              : chr  "no" "yes" "no" "no" ...
##  $ physical.activity             : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ tech.devices.usage            : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ alcohol.frequency             : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ transportation.type           : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ NObeyesdad                    : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

head(df)

##   Gender Age Height Weight family_history_with_overweight
## 1 Female  21   1.62   64.0                            yes
## 2 Female  21   1.52   56.0                            yes
## 3   Male  23   1.80   77.0                            yes
## 4   Male  27   1.80   87.0                             no
## 5   Male  22   1.78   89.8                             no
## 6   Male  29   1.62   53.0                             no
##   high.caloric.food.frequently vegetables.in.meals number.of.main.meals
## 1                           no                   2                    3
## 2                           no                   3                    3
## 3                           no                   2                    3
## 4                           no                   3                    3
## 5                           no                   2                    1
## 6                          yes                   2                    3
##   food.between.meals smoke water.daily monitor.calories physical.activity
## 1          Sometimes    no           2               no                 0
## 2          Sometimes   yes           3              yes                 3
## 3          Sometimes    no           2               no                 2
## 4          Sometimes    no           2               no                 2
## 5          Sometimes    no           2               no                 0
## 6          Sometimes    no           2               no                 0
##   tech.devices.usage alcohol.frequency   transportation.type
## 1                  1                no Public_Transportation
## 2                  0         Sometimes Public_Transportation
## 3                  1        Frequently Public_Transportation
## 4                  0        Frequently               Walking
## 5                  0         Sometimes Public_Transportation
## 6                  0         Sometimes            Automobile
##            NObeyesdad
## 1       Normal_Weight
## 2       Normal_Weight
## 3       Normal_Weight
## 4  Overweight_Level_I
## 5 Overweight_Level_II
## 6       Normal_Weight

df <- df[!duplicated(df), ]
table(df$Gender)

## 
## Female   Male 
##   1035   1052

I aim to ensure an equal distribution of genders in the dataset to maintain the reliability of the results. With 1,035 females and 1,052 males, the distribution is balanced enough to proceed with the analysis.

Additionally, in order to apply the association rules to find pattern in obesity, I first need to transform the data from raw dataset. Character data will be recognize as factor, and each numerical data will be divide into meaningful categories.

2.1 Numerical variables

Physical attributes such as height and weight play a crucial role in determining obesity levels, with the criteria for assessing obesity often varying by age and gender. To create a more intuitive and actionable feature for further research, I will calculate the Body Mass Index (BMI) for each individual.

BMI is a standardized metric widely recommended in population health studies and clinical settings. It provides a simple yet effective measure for evaluating an individual’s weight status relative to their height.

The formula for BMI is as follows: \[ \text{BMI} = \frac{\text{Weight (kg)}}{\text{Height (m)}^2} \] This calculation will support to classify individuals into meaningful categories and support a deeper analysis of obesity-related patterns.

df$BMI <- df$Weight / (df$Height^2)

In the next step, I will divide each numerical variables to 4 different categories based on the distribution of data, and recommendation of BMI categories from the U.S. Centers for Disease Control.

df$bmi_grp <- ifelse(df[, 18] < 18.5, "low BMI", ifelse(df[, 18] < 25, "normal BMI", ifelse(df[, 18] < 30, "high BMI", "very high BMI")))

df$age_grp <- ifelse(df[, 2] < 20, "under 20", ifelse(df[, 2] < 23, "under 23", ifelse(df[, 2] < 27, "23 to 27", "older than 27")))

df$main_meals <- ifelse(df[, 8] < 2, "under 2 meals", ifelse(df[, 8] < 3, "2 to 3 meals", ifelse(df[, 8] == 3, "3 meals per day", "more than 3 meals")))

df$h20_group <- ifelse(df[, 11] < 1.5, "less than 1.5 litres", ifelse(df[,11] < 2, "1.5 - 2 litres", ifelse(df[,11] < 2.5, "2 - 2.5 litres", "more than 2.5 litres")))

df$activity_level <- ifelse(df[, 13] == 0, "do not excersize", ifelse(df[, 13] < 1, "minimal excersize", ifelse(df[, 13] < 2, "moderate excersize", "active")))

df$tech_use <- ifelse(df[, 14] < 0.5, "barely use tech device", ifelse(df[, 14] < 1, "less than an hour", ifelse(df[, 14] < 2, "1 to 2 hours", "more than 2 hours")))

2.2 Character variables

During the survey, there are multiple binary questions, which would cause duplicate values when applying the association rules, therefore, those answers will be also rename.

Within the dataset, the authors classified 2 types of overweight and 3 types of obesity. However, during this project, I aimed to find the factors that lead to obesity and overweight in general, therefore, I will group all obesity types into 1 and overweight into 1 category.

df <- df %>% mutate(
      weight_type = case_when(
      NObeyesdad %in% c("Normal_Weight", "Insufficient_Weight") ~ NObeyesdad,
      NObeyesdad %in% c("Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III") ~ "Obesity",
      NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II") ~ "Overweight",
      TRUE ~ NObeyesdad))

I will then limit the data by choosing appropriate columns for further analysis, get rid of the variable regarding number of vegetables in meals. Checking the dataset before apply association rule.

transdf <- df[, c(1, 16, 19:31)]
head(transdf)

##   Gender   transportation.type    bmi_grp       age_grp      main_meals
## 1 Female Public_Transportation normal BMI      under 23 3 meals per day
## 2 Female Public_Transportation normal BMI      under 23 3 meals per day
## 3   Male Public_Transportation normal BMI      23 to 27 3 meals per day
## 4   Male               Walking   high BMI older than 27 3 meals per day
## 5   Male Public_Transportation   high BMI      under 23   under 2 meals
## 6   Male            Automobile normal BMI older than 27 3 meals per day
##              h20_group   activity_level               tech_use    fam_hist
## 1       2 - 2.5 litres do not excersize           1 to 2 hours has history
## 2 more than 2.5 litres           active barely use tech device has history
## 3       2 - 2.5 litres           active           1 to 2 hours has history
## 4       2 - 2.5 litres           active barely use tech device  no history
## 5       2 - 2.5 litres do not excersize barely use tech device  no history
## 6       2 - 2.5 litres do not excersize barely use tech device  no history
##             high_cal  is_smoke     cal_monitor             alc_freq
## 1  no high kcal food not smoke no monitor kcal     do not drink alc
## 2  no high kcal food     smoke    monitor kcal  sometimes drink alc
## 3  no high kcal food not smoke no monitor kcal drink alc frequently
## 4  no high kcal food not smoke no monitor kcal drink alc frequently
## 5  no high kcal food not smoke no monitor kcal  sometimes drink alc
## 6 eat high kcal food not smoke no monitor kcal  sometimes drink alc
##                   food_btmeal   weight_type
## 1 sometimes eat between meals Normal_Weight
## 2 sometimes eat between meals Normal_Weight
## 3 sometimes eat between meals Normal_Weight
## 4 sometimes eat between meals    Overweight
## 5 sometimes eat between meals    Overweight
## 6 sometimes eat between meals Normal_Weight

write.csv(transdf, file="obese_analysis.csv", row.names = FALSE)

csv file will be imported by using the read.transactions() function to create a proper format for further analysis

transactions <- read.transactions("obese_analysis.csv", format="basket", sep=",", skip=1)
inspect(transactions[1:5])

##     items                         
## [1] {1 to 2 hours,                
##      2 - 2.5 litres,              
##      3 meals per day,             
##      do not drink alc,            
##      do not excersize,            
##      Female,                      
##      has history,                 
##      no high kcal food,           
##      no monitor kcal,             
##      normal BMI,                  
##      Normal_Weight,               
##      not smoke,                   
##      Public_Transportation,       
##      sometimes eat between meals, 
##      under 23}                    
## [2] {3 meals per day,             
##      active,                      
##      barely use tech device,      
##      Female,                      
##      has history,                 
##      monitor kcal,                
##      more than 2.5 litres,        
##      no high kcal food,           
##      normal BMI,                  
##      Normal_Weight,               
##      Public_Transportation,       
##      smoke,                       
##      sometimes drink alc,         
##      sometimes eat between meals, 
##      under 23}                    
## [3] {1 to 2 hours,                
##      2 - 2.5 litres,              
##      23 to 27,                    
##      3 meals per day,             
##      active,                      
##      drink alc frequently,        
##      has history,                 
##      Male,                        
##      no high kcal food,           
##      no monitor kcal,             
##      normal BMI,                  
##      Normal_Weight,               
##      not smoke,                   
##      Public_Transportation,       
##      sometimes eat between meals} 
## [4] {2 - 2.5 litres,              
##      3 meals per day,             
##      active,                      
##      barely use tech device,      
##      drink alc frequently,        
##      high BMI,                    
##      Male,                        
##      no high kcal food,           
##      no history,                  
##      no monitor kcal,             
##      not smoke,                   
##      older than 27,               
##      Overweight,                  
##      sometimes eat between meals, 
##      Walking}                     
## [5] {2 - 2.5 litres,              
##      barely use tech device,      
##      do not excersize,            
##      high BMI,                    
##      Male,                        
##      no high kcal food,           
##      no history,                  
##      no monitor kcal,             
##      not smoke,                   
##      Overweight,                  
##      Public_Transportation,       
##      sometimes drink alc,         
##      sometimes eat between meals, 
##      under 2 meals,               
##      under 23}

length(transactions)

## [1] 2087

3 Factors frequency

How is all the item varies within this dataset, checking the item frequency to recognize the items that appear the most.

sorted_freq <- sort(itemFrequency(transactions, type="relative"))
sorted_freq

##             always drink alc                         Bike 
##                 0.0004791567                 0.0033540968 
##                    Motorbike        no food between meals 
##                 0.0052707235                 0.0177287973 
##                        smoke     always eat between meals 
##                 0.0210828941                 0.0253953043 
##                      Walking         drink alc frequently 
##                 0.0263536176                 0.0335409679 
##                 monitor kcal            more than 2 hours 
##                 0.0459990417                 0.0517489219 
##            more than 3 meals frequently eat between meals 
##                 0.1092477240                 0.1130809775 
##            no high kcal food                      low BMI 
##                 0.1164350743                 0.1274556780 
##          Insufficient_Weight                Normal_Weight 
##                 0.1279348347                 0.1351221850 
##               1.5 - 2 litres                 2 to 3 meals 
##                 0.1360804983                 0.1365596550 
##                   normal BMI                   no history 
##                 0.1413512218                 0.1748921897 
##                       active                under 2 meals 
##                 0.1796837566                 0.1796837566 
##             do not excersize                   Automobile 
##                 0.1940584571                 0.2184954480 
##                older than 27            less than an hour 
##                 0.2204120747                 0.2218495448 
##         less than 1.5 litres         more than 2.5 litres 
##                 0.2290368951                 0.2405366555 
##                     under 20                     23 to 27 
##                 0.2558696694                 0.2606612362 
##                     under 23                     high BMI 
##                 0.2630570196                 0.2644944897 
##                   Overweight                 1 to 2 hours 
##                 0.2712026833                 0.2798275036 
##            minimal excersize             do not drink alc 
##                 0.2874940105                 0.3047436512 
##           moderate excersize               2 - 2.5 litres 
##                 0.3387637758                 0.3943459511 
##       barely use tech device                      Obesity 
##                 0.4465740297                 0.4657402971 
##                very high BMI                       Female 
##                 0.4666986104                 0.4959271682 
##                         Male              3 meals per day 
##                 0.5040728318                 0.5745088644 
##          sometimes drink alc        Public_Transportation 
##                 0.6612362242                 0.7465261140 
##                  has history  sometimes eat between meals 
##                 0.8251078103                 0.8437949209 
##           eat high kcal food              no monitor kcal 
##                 0.8835649257                 0.9540009583 
##                    not smoke 
##                 0.9789171059

par(mar = c(10, 4, 4, 2))
barplot(sorted_freq, main = "Item Frequency", col = "skyblue", las = 2)

In general, there are 12.79% people that has insufficient weight, 13.5% is in normal weight range. Clearly, the number of people who are expose to overweight and obesity is significantly higher than those with normal and low weight since 27% are overweight and 46.57% has obesity.

4 Association rules

Now, I would like to check the association rules in of each type of weight, starting with a reasonable level of support, then I will subset the rules to find the factors of obesity.

rules <- apriori(transactions, parameter = list(supp = 0.1, conf = 0.8, maxlen = 5),
                         appearance = list(default = "lhs", rhs = c("Obesity", "Overweight", "Normal_Weight", "Insufficient_Weight")), control=list(verbose=F))
summary(rules)

## set of 760 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##   4  40 190 526 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   5.000   4.629   5.000   5.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1001   Min.   :0.8007   Min.   :0.1001   Min.   :1.719  
##  1st Qu.:0.1131   1st Qu.:0.9929   1st Qu.:0.1188   1st Qu.:2.135  
##  Median :0.1385   Median :0.9970   Median :0.1457   Median :2.147  
##  Mean   :0.1673   Mean   :0.9728   Mean   :0.1714   Mean   :2.394  
##  3rd Qu.:0.1774   3rd Qu.:1.0000   3rd Qu.:0.1827   3rd Qu.:2.147  
##  Max.   :0.4648   Max.   :1.0000   Max.   :0.4667   Max.   :7.744  
##      count      
##  Min.   :209.0  
##  1st Qu.:236.0  
##  Median :289.0  
##  Mean   :349.2  
##  3rd Qu.:370.2  
##  Max.   :970.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          2087     0.1        0.8
##                                                                                                                                                                                                                           call
##  apriori(data = transactions, parameter = list(supp = 0.1, conf = 0.8, maxlen = 5), appearance = list(default = "lhs", rhs = c("Obesity", "Overweight", "Normal_Weight", "Insufficient_Weight")), control = list(verbose = F))

rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted))

##     lhs                      rhs                     support confidence  coverage     lift count
## [1] {eat high kcal food,                                                                        
##      low BMI}             => {Insufficient_Weight} 0.1020604  0.9906977 0.1030187 7.743768   213
## [2] {eat high kcal food,                                                                        
##      low BMI,                                                                                   
##      not smoke}           => {Insufficient_Weight} 0.1020604  0.9906977 0.1030187 7.743768   213
## [3] {low BMI}             => {Insufficient_Weight} 0.1260182  0.9887218 0.1274557 7.728324   263
## [4] {low BMI,                                                                                   
##      not smoke}           => {Insufficient_Weight} 0.1255391  0.9886792 0.1269765 7.727991   262
## [5] {low BMI,                                                                                   
##      no monitor kcal}     => {Insufficient_Weight} 0.1154768  0.9877049 0.1169142 7.720375   241
## [6] {low BMI,                                                                                   
##      no monitor kcal,                                                                           
##      not smoke}           => {Insufficient_Weight} 0.1149976  0.9876543 0.1164351 7.719980   240

plot(rules, method = "grouped")

From the summary, there are 760 rules were mined by using Apriori algorithm, most of the rules range from 2 to 5 items. The support range from 1% to 46.5%, with the confidence is high, coming from 0.8 to 1, suggest great predictability of the right hand side from left hand side.

However, as there are still many rules which fall into the low support range, meaning they occur only in a few transactions. Hence, I will clean the rules by remove the redundant rules.

strong_rules <- rules[!is.redundant(rules)]
strong_rules <- strong_rules[(is.maximal(strong_rules))]
summary(strong_rules)

## set of 120 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5 
## 13 19 88 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   5.000   4.625   5.000   5.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1001   Min.   :0.8007   Min.   :0.1001   Min.   :1.719  
##  1st Qu.:0.1048   1st Qu.:0.8232   1st Qu.:0.1198   1st Qu.:1.768  
##  Median :0.1248   Median :0.8684   Median :0.1354   Median :1.864  
##  Mean   :0.1326   Mean   :0.8930   Mean   :0.1485   Mean   :2.070  
##  3rd Qu.:0.1443   3rd Qu.:0.9920   3rd Qu.:0.1672   3rd Qu.:2.147  
##  Max.   :0.3598   Max.   :1.0000   Max.   :0.3598   Max.   :7.744  
##      count      
##  Min.   :209.0  
##  1st Qu.:218.8  
##  Median :260.5  
##  Mean   :276.6  
##  3rd Qu.:301.2  
##  Max.   :751.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          2087     0.1        0.8
##                                                                                                                                                                                                                           call
##  apriori(data = transactions, parameter = list(supp = 0.1, conf = 0.8, maxlen = 5), appearance = list(default = "lhs", rhs = c("Obesity", "Overweight", "Normal_Weight", "Insufficient_Weight")), control = list(verbose = F))

par(mar = c(10, 4, 4, 2) + 0.1)
plot(strong_rules, method = "grouped")

After cleaning, I now have 120 rules in total. Based on the group matrix plot, the larger the bubble, the higher support it has. Along with that, the color of bubble indicates the lift of the rules. A stronger lift means a stronger association between LHS and RHS.

4.1 Overweight

I would like to see the association rules for people with overweight.

overweight_rules <- subset(strong_rules, rhs %in% "Overweight")
inspectDT(overweight_rules)

plot(overweight_rules, method = "graph", control = list(type = "items"))

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

For better illustrate the connection of the left-hand side and right-hand side, I will plot Sankey diagram, with the width of arrow corresponds to the support level of each rule, showing how the LHS contribute to the outcome.

library(networkD3)
#convert rules into a data frame and extract the lhs of the rules 
rule_data <- as(overweight_rules, "data.frame")
rule_data$lhs <- gsub("\\{|\\}", "", rule_data$rules)
rule_data$lhs <- sapply(strsplit(rule_data$lhs, " => "), "[", 1)
rule_data$rhs <- "Overweight"  

nodes <- data.frame(name = unique(c(rule_data$lhs, rule_data$rhs)))
links <- data.frame(source = match(rule_data$lhs, nodes$name) - 1,
  target = match(rule_data$rhs, nodes$name) - 1, value = rule_data$support)

sankeyNetwork(Links = links, Nodes = nodes, Source = "source", 
              Target = "target",Value = "value", NodeID = "name", 
              units = "Support",fontSize = 12,nodeWidth = 30)

As I can see, high BMI is central and strongly connected to overweight, and it can be consider as one of the most significant contributors to define people in overweight category. Additionally, the dark red color from the graph indicates a high lift, signifying that individuals with high BMI are much more likely to be classified as overweight compared to the baseline probability.

On the other hand, males are more likely to fall into the overweight category than females, and people who are overweight tend to eat 3 meals per day.

Other factor to consider is alcohol consumption, even though it has a lower support, indicates a lesser influence overall. However, it still plays a role as part of the rules combination contributing to being overweight.

4.2 Obesity

The association rules for individuals with obesity will be checked.

obese_rules <- subset(strong_rules, rhs %in% "Obesity")
inspectDT(obese_rules)

There are total of 113 rules for obesity, to better look into the relation of it, I will visualize them on the plots.

plot(obese_rules, method = "graph", control = list(type = "items"), main = "Obesity rules ")

## Warning: Unknown control parameters: type, main

## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).

plot(obese_rules, method = "paracoord", control = list(type = "items"), main = "Parallel coordinates plot for obesity rules")

## Warning: Unknown control parameters: type

## Available control parameters (with default values):
## main  =  Parallel coordinates plot for 113 rules
## reorder   =  TRUE
## interactive   =  FALSE
## engine    =  default
## gp_labels     =  list()
## newpage   =  TRUE
## col   =  c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF",  "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF",  "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## alpha     =  NULL
## quality   =  2
## verbose   =  FALSE

As illustrated from the plots and with the level of confidence equal 100%, the prominent factor connect to obesity is very high BMI value, which in fact, BMI is a valid indicator to define obesity.

The other lifestyle and demographic factors that interact with “Very High BMI” are dietary habits, activity levels. With the second highest support of 34%, people who consume high-calorie foods, and sometimes drink alcohol strongly associate with obesity. Besides that, people that use public transportation, eat high calories food, along with very high BMI also appear to have obesity (support: 36% and lift: 2.147)

Individuals aged above 23 frequently appear in the rules, suggests a shift in weight-related behaviors or metabolism in these age groups.

For more visualization of the result, I have also plot the word cloud showing association rules of obesity.

#extract lhs items as string
lhs <- gsub("[{}]", "", labels(lhs(obese_rules)))
#split them into individual items for plotting
split_lhs<- unlist(strsplit(lhs, ","))
lhsfreq <- table(split_lhs)
set.seed(42)
wordcloud(words = names(lhsfreq), freq = as.numeric(lhsfreq),   
  min.freq = 2, max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"), scale = c(2, 0.5))

Also, it is worth mentioning that individuals with a family history of obesity are more likely to develop obesity themselves. From the word cloud plot, I can notice that female appears more in the rules associated with obesity, which contrasts with the rules for overweight, where males are more prominent.

Interestingly, the use of technological devices shows minimal interaction with obesity, which is surprising for me because I thought individuals with obesity would use technical devices such as TV, or mobile phones frequently.

5 Conclusion:

To manage physical health effectively, monitoring calories intake is a crucial factor. However, this aspect is often overlooked among individuals in the overweight and obesity groups.

Overweight is a precursor into obesity, hence, when people are getting signal of this, they should pay attention to adjust their lifestyle properly. Individuals should get encourage to frequently monitor their weight and BMI to identify early signs of overweight and obesity.

For businesses, publishing the information regarding calories per dish could better improve the awareness of calories intake, which is crucial for those with the risk of obesity.

For the government, it will be great to implement health promotion programs that provide incentives for healthy behaviors by collaborate with schools, workplaces, and communities, encourage people to stay active, consume less alcohol, and potentially, provide nutritional counseling and fitness programs aimed at breaking the cycle of obesity within families.

6 References:

The Data set came from UC Irvine Machine Learning Repository, which can be found here Eduardo De-La-Hoz-Correa, Fabio E. Mendoza-Palechor, Alexis De-La-Hoz-Manotas, Roberto C. Morales-Ortegaand Sánchez Hernández Beatriz Adriana (2019) Obesity Level Estimation Software based on Decision Trees

Body Mass Index categories from the U.S. Centers for Disease Control and Prevention, which can be found here

Marc-André Cornier, MD (2022) A Review of Current Guidelines for the Treatment of Obesity

Finding Factors of Obesity using Apriori Association Rule Mining

Chi Dao