Introduction

Obesity is a complex condition influenced by various factors, including lifestyle, genetics, and behavior. One of the critical aspects is how these factors differ among individuals. Therefore, the aim of this work is to analyze the factors contributing to obesity using a data-driven approach. In this article, I will use association rules, an unsupervised learning technique that seeks to uncover and describe patterns and relationships between variables in the dataset.

Dataset

family_history : Has a family member suffered or suffers from overweight?
FAVC : Do you eat high caloric food frequently?
CAEC : Do you eat any food between meals?
SMOKE : Do you smoke?
SCC : Do you monitor the calories you eat daily?
CALC : How often do you drink alcohol?
MTRANS : Which transportation do you usually use?
Obesity_level (Target Column) : Obesity level
This dataset focuses on factors influencing obesity levels, including variables such as age, gender, height, weight, and lifestyle habits like diet, physical activity, and technology use. It aims to uncover patterns and relationships using association rules to better understand the contributors to obesity.

# Load necessary libraries
library(arules)
library(arulesViz)
library(dplyr)
library(ggplot2)
library(caret)
library(naniar)
# Load the dataset
obdata <- read.csv("C:/Users/Pandita/Desktop/Obesity.csv", sep = ",", dec = ".", header = TRUE)
str(obdata)
## 'data.frame':    1592 obs. of  12 variables:
##  $ Gender        : chr  "Female" "Female" "Male" "Male" ...
##  $ Age           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Height        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ family_history: chr  "yes" "yes" "yes" "no" ...
##  $ FAVC          : chr  "no" "no" "no" "no" ...
##  $ CAEC          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
##  $ SMOKE         : chr  "no" "yes" "no" "no" ...
##  $ SCC           : chr  "no" "yes" "no" "no" ...
##  $ CALC          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
##  $ MTRANS        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
##  $ Obesity       : chr  "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...

Check for missing values

vis_miss(obdata)

## No missing values were found in the data

# Converting variables into integer
obdata$Age <- as.integer(obdata$Age)
# Converting categorical variables into factors
obdata$Gender <- as.factor(obdata$Gender)
obdata$family_history <- as.factor(obdata$family_history)
obdata$CAEC <- as.factor(obdata$CAEC)
obdata$SCC <- as.factor(obdata$SCC)
obdata$SMOKE <- as.factor(obdata$SMOKE)
obdata$CALC <- as.factor(obdata$CALC)
obdata$MTRANS <- as.factor(obdata$MTRANS)
obdata$FAVC <- as.factor(obdata$FAVC)
obdata$Obesity <- as.factor(obdata$Obesity)

Analysis

BMI

In order to more fully assess an individual’s health, I calculated BMI (Body Mass Index) using height and weight.BMI is a commonly used metric for determining whether or not weight is within a healthy range. By calculating the BMI, I can better understand whether the individuals in my data are underweight, normal, overweight, or obese, which can support subsequent health analysis and decision-making.

# Grouping for age and BMI
obdata$BMI <- obdata$Weight / (obdata$Height^2)
obdata$BMI <- as.integer(obdata$BMI)
quantile(obdata$Age)
##   0%  25%  50%  75% 100% 
##   14   19   22   26   61
quantile(obdata$BMI)
##   0%  25%  50%  75% 100% 
##   12   21   27   32   49
# Grouping of BMI by quartiles
obdata$BMIGroup <- cut(obdata$BMI, 
                       breaks = c(0, 18.5, 24, 28, 36, Inf), 
                       labels = c("Underweight", "Normal", "Overweight", "Obese", "Severely Obese"),
                       right = FALSE)
# Factorization
obdata$BMIGroup <- as.factor(obdata$BMIGroup)
# Grouping of AGE by quartiles
obdata$AgeGroup <- cut(obdata$Age, 
                       breaks = c(0, 18, 25, 35, 50, Inf), 
                       labels = c("Teen", "Young Adult", "Adult", "Middle Age", "Senior"),
                       right = FALSE)

# Factorization
obdata$AgeGroup <- as.factor(obdata$AgeGroup)
obdata <- obdata[, !(names(obdata) %in% c("Age", "Height", "Weight","BMI"))]
summary(obdata)
##     Gender    family_history  FAVC              CAEC      SMOKE       SCC      
##  Female:682   no : 359       no : 241   Always    :  53   no :1552   no :1502  
##  Male  :910   yes:1233       yes:1351   Frequently: 241   yes:  40   yes:  90  
##                                         no        :  33                        
##                                         Sometimes :1265                        
##                                                                                
##                                                                                
##                                                                                
##          CALC                       MTRANS                    Obesity   
##  Always    :  1   Automobile           : 399   Insufficient_Weight:272  
##  Frequently: 66   Bike                 :   7   Normal_Weight      :287  
##  no        :605   Motorbike            :  11   Obesity_Type_I     :351  
##  Sometimes :920   Public_Transportation:1119   Obesity_Type_II    :199  
##                   Walking              :  56   Obesity_Type_III   : 12  
##                                                Overweight_Level_I :181  
##                                                Overweight_Level_II:290  
##            BMIGroup          AgeGroup  
##  Underweight   :287   Teen       :109  
##  Normal        :213   Young Adult:981  
##  Overweight    :361   Adult      :350  
##  Obese         :580   Middle Age :142  
##  Severely Obese:151   Senior     : 10  
##                                        
## 

Exploratory Data Analysis (EDA)

# Bar charts for BMI grouping
ggplot(obdata, aes(x = BMIGroup, fill = BMIGroup)) +
  geom_bar() +
  labs(x = "BMI Group", y = "Count", title = "Distribution of BMI Groups") +
  theme_minimal() +
  theme(legend.position = "none")

# Bar charts for age grouping
ggplot(obdata, aes(x = AgeGroup, fill = AgeGroup)) +
  geom_bar() +
  labs(x = "Age Group", y = "Count", title = "Distribution of Age Groups") +
  theme_minimal() +
  theme(legend.position = "none")

The Apriori algorithm

At the beginning, I will apply the apriori algorithm without making any detailed assumptions. I will use the default parameters sizes.

# Convert to transactions
transactions <- as(obdata, "transactions")
summary(transactions)
## transactions as itemMatrix in sparse format with
##  1592 rows (elements/itemsets/transactions) and
##  40 columns (items) and a density of 0.275 
## 
## most frequent items:
##           SMOKE=no             SCC=no           FAVC=yes     CAEC=Sometimes 
##               1552               1502               1351               1265 
## family_history=yes            (Other) 
##               1233              10609 
## 
## element (itemset/transaction) length distribution:
## sizes
##   11 
## 1592 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      11      11      11      11      11      11 
## 
## includes extended item information - examples:
##              labels      variables levels
## 1     Gender=Female         Gender Female
## 2       Gender=Male         Gender   Male
## 3 family_history=no family_history     no
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

The plot above shows us the frequency of occurance of each answer. The most frequent is favorite color - Cool (colors reported by respondents were mapped to either warm, cool or neutral).

# Discovering association rules using the apriori algorithm
rules <- apriori(
  data = transactions,
  parameter = list(
    supp = 0.18,    # Minimum support (5%)
    conf = 0.97,     # Minimum confidence (90%)
    minlen = 4      # Minimum rule length
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.97    0.1    1 none FALSE            TRUE       5    0.18      4
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 286 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[40 item(s), 1592 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [675 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
# Check the result of rule mining
rules
## set of 675 rules
summary(rules)
## set of 675 rules
## 
## rule length distribution (lhs + rhs):sizes
##   4   5   6   7   8 
## 231 257 148  37   2 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.000   4.000   5.000   4.996   6.000   8.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.1803   Min.   :0.9700   Min.   :0.1803   Min.   :0.9952  
##  1st Qu.:0.2029   1st Qu.:0.9788   1st Qu.:0.2054   1st Qu.:1.0126  
##  Median :0.2155   Median :0.9858   Median :0.2186   Median :1.0381  
##  Mean   :0.2567   Mean   :0.9856   Mean   :0.2607   Mean   :1.1301  
##  3rd Qu.:0.2880   3rd Qu.:0.9915   3rd Qu.:0.2943   3rd Qu.:1.0568  
##  Max.   :0.6646   Max.   :1.0000   Max.   :0.6765   Max.   :2.7448  
##      count       
##  Min.   : 287.0  
##  1st Qu.: 323.0  
##  Median : 343.0  
##  Mean   : 408.7  
##  3rd Qu.: 458.5  
##  Max.   :1058.0  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          1592    0.18       0.97
##                                                                                  call
##  apriori(data = transactions, parameter = list(supp = 0.18, conf = 0.97, minlen = 4))

The maximum lift (~2.74) is moderate—these rules do show noticeable relationships, but not extremely strong ones.

rules_filtered <- subset(rules, subset = lift > 1.5)

# Removing redundant rules
rules_non_redundant <- rules_filtered[!is.redundant(rules_filtered)]

# View the top 10 rules
inspect(head(rules_non_redundant, n = 10))
##      lhs                         rhs                support confidence  coverage     lift count
## [1]  {family_history=yes,                                                                      
##       CAEC=Sometimes,                                                                          
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2085427          1 0.2085427 2.744828   332
## [2]  {family_history=yes,                                                                      
##       FAVC=yes,                                                                                
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2097990          1 0.2097990 2.744828   334
## [3]  {family_history=yes,                                                                      
##       SCC=no,                                                                                  
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2154523          1 0.2154523 2.744828   343
## [4]  {family_history=yes,                                                                      
##       SMOKE=no,                                                                                
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2129397          1 0.2129397 2.744828   339
## [5]  {FAVC=yes,                                                                                
##       CAEC=Sometimes,                                                                          
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2054020          1 0.2054020 2.744828   327
## [6]  {CAEC=Sometimes,                                                                          
##       SCC=no,                                                                                  
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2110553          1 0.2110553 2.744828   336
## [7]  {CAEC=Sometimes,                                                                          
##       SMOKE=no,                                                                                
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2091709          1 0.2091709 2.744828   333
## [8]  {FAVC=yes,                                                                                
##       SCC=no,                                                                                  
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2129397          1 0.2129397 2.744828   339
## [9]  {FAVC=yes,                                                                                
##       SMOKE=no,                                                                                
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2110553          1 0.2110553 2.744828   336
## [10] {SMOKE=no,                                                                                
##       SCC=no,                                                                                  
##       Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2154523          1 0.2154523 2.744828   343
# Subset: rules where the consequent (rhs) contains "Obesity_Type_III"
rules_default_yes <- subset(rules, subset = rhs %pin% "Obesity_Type_III")
inspect(rules_default_yes)

BMI and obesity: When an individual had a family history of obesity (family_history=yes), was a non-smoker (SMOKE=no), and belonged to Obesity Type I (Obesity=Obesity_Type_I), the probability that their BMI grouping was “Obese” (BMIGroup= Obese) has a very high probability. This rule has a support of 0.213, which means that about 21.3% of the samples in the dataset satisfy this condition, and a confidence level of 1, which means that of all the samples that satisfy the antecedent (left-hand side condition), 100% satisfy the consequent (right-hand side result). In addition, the rule has an uplift of 2.745, indicating a strong correlation, i.e., there is a significant positive correlation between family history of obesity, non-smoking, and obesity type I, and BMI grouping of “obese”. A total of 339 samples were covered by this rule. ## Visualization of Rules

# Basic scatter plot (support vs. confidence) with color shading for lift
plot(rules, measure="support", shading="lift")

# Grouped matrix visualization
plot(rules, method = "grouped",k= 18)

# Focus on top 20 rules by lift
sub_rules <- head(sort(rules, by="lift", decreasing=TRUE), 20)
plot(sub_rules, method = "graph", engine = "html")

Conclusion

A confidence level of 1 for all rules indicates that there is a 100% probability that an individual will be categorized as “obese” (BMIGroup=Obese) under these specific conditions (e.g., family history, dietary habits, smoking status, etc.). This indicates a strong association between these conditions and obesity.
The support for all rules ranged from 0.205 to 0.215, indicating that these rules occurred more frequently in the dataset and were more generalizable. The lift (lift) of all rules was 2.744828, indicating that the prevalence of obesity in these conditions was 2.74 times the baseline level, further validating the strong correlation between these conditions and obesity. Obesity=Obesity_Type_I (Obesity Type I) was a common key factor in all rules, indicating that Obesity Type I played a central role in these association rules. In addition, factors such as family_history=yes, dietary habits (e.g., FAVC=yes, CAEC=Sometimes), and non-smoking (SMOKE=no) were also frequent, suggesting that these factors have an important influence in the development of obesity.
The coverage (COVERAGE) of all the rules was consistent with the support, indicating that these rules have a wide coverage in the dataset, further supporting their reliability in practical applications. These association rules revealed strong associations between obesity and multiple factors (e.g., family history, dietary habits, smoking status, etc.), especially in the case of obesity type I. The results showed that obesity was associated with a wide range of factors (e.g., family history, dietary habits, smoking status, etc.). These findings can provide an important reference for obesity prevention and intervention.