Obesity is a complex condition influenced by various factors, including lifestyle, genetics, and behavior. One of the critical aspects is how these factors differ among individuals. Therefore, the aim of this work is to analyze the factors contributing to obesity using a data-driven approach. In this article, I will use association rules, an unsupervised learning technique that seeks to uncover and describe patterns and relationships between variables in the dataset.
family_history : Has a family member suffered or suffers from
overweight?
FAVC : Do you eat high caloric food frequently?
CAEC : Do you eat any food between meals?
SMOKE : Do you smoke?
SCC : Do you monitor the calories you eat daily?
CALC : How often do you drink alcohol?
MTRANS : Which transportation do you usually use?
Obesity_level (Target Column) : Obesity level
This dataset focuses on factors influencing obesity levels, including
variables such as age, gender, height, weight, and lifestyle habits like
diet, physical activity, and technology use. It aims to uncover patterns
and relationships using association rules to better understand the
contributors to obesity.
# Load necessary libraries
library(arules)
library(arulesViz)
library(dplyr)
library(ggplot2)
library(caret)
library(naniar)
# Load the dataset
obdata <- read.csv("C:/Users/Pandita/Desktop/Obesity.csv", sep = ",", dec = ".", header = TRUE)
str(obdata)
## 'data.frame': 1592 obs. of 12 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history: chr "yes" "yes" "yes" "no" ...
## $ FAVC : chr "no" "no" "no" "no" ...
## $ CAEC : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ SMOKE : chr "no" "yes" "no" "no" ...
## $ SCC : chr "no" "yes" "no" "no" ...
## $ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ Obesity : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
vis_miss(obdata)
## No missing values were found in the data
# Converting variables into integer
obdata$Age <- as.integer(obdata$Age)
# Converting categorical variables into factors
obdata$Gender <- as.factor(obdata$Gender)
obdata$family_history <- as.factor(obdata$family_history)
obdata$CAEC <- as.factor(obdata$CAEC)
obdata$SCC <- as.factor(obdata$SCC)
obdata$SMOKE <- as.factor(obdata$SMOKE)
obdata$CALC <- as.factor(obdata$CALC)
obdata$MTRANS <- as.factor(obdata$MTRANS)
obdata$FAVC <- as.factor(obdata$FAVC)
obdata$Obesity <- as.factor(obdata$Obesity)
In order to more fully assess an individual’s health, I calculated BMI (Body Mass Index) using height and weight.BMI is a commonly used metric for determining whether or not weight is within a healthy range. By calculating the BMI, I can better understand whether the individuals in my data are underweight, normal, overweight, or obese, which can support subsequent health analysis and decision-making.
# Grouping for age and BMI
obdata$BMI <- obdata$Weight / (obdata$Height^2)
obdata$BMI <- as.integer(obdata$BMI)
quantile(obdata$Age)
## 0% 25% 50% 75% 100%
## 14 19 22 26 61
quantile(obdata$BMI)
## 0% 25% 50% 75% 100%
## 12 21 27 32 49
# Grouping of BMI by quartiles
obdata$BMIGroup <- cut(obdata$BMI,
breaks = c(0, 18.5, 24, 28, 36, Inf),
labels = c("Underweight", "Normal", "Overweight", "Obese", "Severely Obese"),
right = FALSE)
# Factorization
obdata$BMIGroup <- as.factor(obdata$BMIGroup)
# Grouping of AGE by quartiles
obdata$AgeGroup <- cut(obdata$Age,
breaks = c(0, 18, 25, 35, 50, Inf),
labels = c("Teen", "Young Adult", "Adult", "Middle Age", "Senior"),
right = FALSE)
# Factorization
obdata$AgeGroup <- as.factor(obdata$AgeGroup)
obdata <- obdata[, !(names(obdata) %in% c("Age", "Height", "Weight","BMI"))]
summary(obdata)
## Gender family_history FAVC CAEC SMOKE SCC
## Female:682 no : 359 no : 241 Always : 53 no :1552 no :1502
## Male :910 yes:1233 yes:1351 Frequently: 241 yes: 40 yes: 90
## no : 33
## Sometimes :1265
##
##
##
## CALC MTRANS Obesity
## Always : 1 Automobile : 399 Insufficient_Weight:272
## Frequently: 66 Bike : 7 Normal_Weight :287
## no :605 Motorbike : 11 Obesity_Type_I :351
## Sometimes :920 Public_Transportation:1119 Obesity_Type_II :199
## Walking : 56 Obesity_Type_III : 12
## Overweight_Level_I :181
## Overweight_Level_II:290
## BMIGroup AgeGroup
## Underweight :287 Teen :109
## Normal :213 Young Adult:981
## Overweight :361 Adult :350
## Obese :580 Middle Age :142
## Severely Obese:151 Senior : 10
##
##
# Bar charts for BMI grouping
ggplot(obdata, aes(x = BMIGroup, fill = BMIGroup)) +
geom_bar() +
labs(x = "BMI Group", y = "Count", title = "Distribution of BMI Groups") +
theme_minimal() +
theme(legend.position = "none")
# Bar charts for age grouping
ggplot(obdata, aes(x = AgeGroup, fill = AgeGroup)) +
geom_bar() +
labs(x = "Age Group", y = "Count", title = "Distribution of Age Groups") +
theme_minimal() +
theme(legend.position = "none")
At the beginning, I will apply the apriori algorithm without making any detailed assumptions. I will use the default parameters sizes.
# Convert to transactions
transactions <- as(obdata, "transactions")
summary(transactions)
## transactions as itemMatrix in sparse format with
## 1592 rows (elements/itemsets/transactions) and
## 40 columns (items) and a density of 0.275
##
## most frequent items:
## SMOKE=no SCC=no FAVC=yes CAEC=Sometimes
## 1552 1502 1351 1265
## family_history=yes (Other)
## 1233 10609
##
## element (itemset/transaction) length distribution:
## sizes
## 11
## 1592
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11 11 11 11 11 11
##
## includes extended item information - examples:
## labels variables levels
## 1 Gender=Female Gender Female
## 2 Gender=Male Gender Male
## 3 family_history=no family_history no
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
The plot above shows us the frequency of occurance of each answer. The most frequent is favorite color - Cool (colors reported by respondents were mapped to either warm, cool or neutral).
# Discovering association rules using the apriori algorithm
rules <- apriori(
data = transactions,
parameter = list(
supp = 0.18, # Minimum support (5%)
conf = 0.97, # Minimum confidence (90%)
minlen = 4 # Minimum rule length
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.97 0.1 1 none FALSE TRUE 5 0.18 4
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 286
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[40 item(s), 1592 transaction(s)] done [0.00s].
## sorting and recoding items ... [20 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 done [0.00s].
## writing ... [675 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
# Check the result of rule mining
rules
## set of 675 rules
summary(rules)
## set of 675 rules
##
## rule length distribution (lhs + rhs):sizes
## 4 5 6 7 8
## 231 257 148 37 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 4.000 5.000 4.996 6.000 8.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1803 Min. :0.9700 Min. :0.1803 Min. :0.9952
## 1st Qu.:0.2029 1st Qu.:0.9788 1st Qu.:0.2054 1st Qu.:1.0126
## Median :0.2155 Median :0.9858 Median :0.2186 Median :1.0381
## Mean :0.2567 Mean :0.9856 Mean :0.2607 Mean :1.1301
## 3rd Qu.:0.2880 3rd Qu.:0.9915 3rd Qu.:0.2943 3rd Qu.:1.0568
## Max. :0.6646 Max. :1.0000 Max. :0.6765 Max. :2.7448
## count
## Min. : 287.0
## 1st Qu.: 323.0
## Median : 343.0
## Mean : 408.7
## 3rd Qu.: 458.5
## Max. :1058.0
##
## mining info:
## data ntransactions support confidence
## transactions 1592 0.18 0.97
## call
## apriori(data = transactions, parameter = list(supp = 0.18, conf = 0.97, minlen = 4))
The maximum lift (~2.74) is moderate—these rules do show noticeable relationships, but not extremely strong ones.
rules_filtered <- subset(rules, subset = lift > 1.5)
# Removing redundant rules
rules_non_redundant <- rules_filtered[!is.redundant(rules_filtered)]
# View the top 10 rules
inspect(head(rules_non_redundant, n = 10))
## lhs rhs support confidence coverage lift count
## [1] {family_history=yes,
## CAEC=Sometimes,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2085427 1 0.2085427 2.744828 332
## [2] {family_history=yes,
## FAVC=yes,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2097990 1 0.2097990 2.744828 334
## [3] {family_history=yes,
## SCC=no,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2154523 1 0.2154523 2.744828 343
## [4] {family_history=yes,
## SMOKE=no,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2129397 1 0.2129397 2.744828 339
## [5] {FAVC=yes,
## CAEC=Sometimes,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2054020 1 0.2054020 2.744828 327
## [6] {CAEC=Sometimes,
## SCC=no,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2110553 1 0.2110553 2.744828 336
## [7] {CAEC=Sometimes,
## SMOKE=no,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2091709 1 0.2091709 2.744828 333
## [8] {FAVC=yes,
## SCC=no,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2129397 1 0.2129397 2.744828 339
## [9] {FAVC=yes,
## SMOKE=no,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2110553 1 0.2110553 2.744828 336
## [10] {SMOKE=no,
## SCC=no,
## Obesity=Obesity_Type_I} => {BMIGroup=Obese} 0.2154523 1 0.2154523 2.744828 343
# Subset: rules where the consequent (rhs) contains "Obesity_Type_III"
rules_default_yes <- subset(rules, subset = rhs %pin% "Obesity_Type_III")
inspect(rules_default_yes)
BMI and obesity: When an individual had a family history of obesity (family_history=yes), was a non-smoker (SMOKE=no), and belonged to Obesity Type I (Obesity=Obesity_Type_I), the probability that their BMI grouping was “Obese” (BMIGroup= Obese) has a very high probability. This rule has a support of 0.213, which means that about 21.3% of the samples in the dataset satisfy this condition, and a confidence level of 1, which means that of all the samples that satisfy the antecedent (left-hand side condition), 100% satisfy the consequent (right-hand side result). In addition, the rule has an uplift of 2.745, indicating a strong correlation, i.e., there is a significant positive correlation between family history of obesity, non-smoking, and obesity type I, and BMI grouping of “obese”. A total of 339 samples were covered by this rule. ## Visualization of Rules
# Basic scatter plot (support vs. confidence) with color shading for lift
plot(rules, measure="support", shading="lift")
# Grouped matrix visualization
plot(rules, method = "grouped",k= 18)
# Focus on top 20 rules by lift
sub_rules <- head(sort(rules, by="lift", decreasing=TRUE), 20)
plot(sub_rules, method = "graph", engine = "html")
A confidence level of 1 for all rules indicates that there is a 100%
probability that an individual will be categorized as “obese”
(BMIGroup=Obese) under these specific conditions (e.g., family history,
dietary habits, smoking status, etc.). This indicates a strong
association between these conditions and obesity.
The support for all rules ranged from 0.205 to 0.215, indicating that
these rules occurred more frequently in the dataset and were more
generalizable. The lift (lift) of all rules was 2.744828, indicating
that the prevalence of obesity in these conditions was 2.74 times the
baseline level, further validating the strong correlation between these
conditions and obesity. Obesity=Obesity_Type_I (Obesity Type I) was a
common key factor in all rules, indicating that Obesity Type I played a
central role in these association rules. In addition, factors such as
family_history=yes, dietary habits (e.g., FAVC=yes, CAEC=Sometimes), and
non-smoking (SMOKE=no) were also frequent, suggesting that these factors
have an important influence in the development of obesity.
The coverage (COVERAGE) of all the rules was consistent with the
support, indicating that these rules have a wide coverage in the
dataset, further supporting their reliability in practical applications.
These association rules revealed strong associations between obesity and
multiple factors (e.g., family history, dietary habits, smoking status,
etc.), especially in the case of obesity type I. The results showed that
obesity was associated with a wide range of factors (e.g., family
history, dietary habits, smoking status, etc.). These findings can
provide an important reference for obesity prevention and
intervention.