Obesity is a complex global health challenge with roots extending beyond individual lifestyle choices. Despite the modern trend regarding health consciousness which encourage people to embrace healthier habits, including better diet and doing more exercise, obesity remains a prevalent issue.
Based on the dataset that estimate obesity levels in individual from Mexico, Peru and Colombia in 2019, this project aims to uncover hidden patterns causing obesity in these three nations by using association rule mining.
For more information regarding this dataset, 77% of the data was generated synthetically using the Weka tool and the SMOTE filter, and 23% of the data was collected directly from users through a web platform by 5 researchers in Colombia (De-La-Hoz-Correa et al., 2019).
In this dataset, there is 1 target which is the level of obesity and
16 features including gender, age, height, weight, family history with
over weight, and these questions:
Do you eat high caloric food
frequently?
Do you usually eat vegetables in your meals?
How
many main meals do you have daily?
Do you eat any food between
meals?
Do you smoke?
How much water do you drink daily?
Do you monitor the calories you eat daily?
How often do you have
physical activity (per week)?
How much time do you use
technological devices such as cell phone, video games, television,
computer and others daily?
How often do you drink alcohol?
Which transportation do you usually use?
During the first step of data processing, I will explore the data, checking the structure and remove duplicate observations.
df <- read.csv("obesity.csv", sep = ",")
str(df)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : num 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
## $ high.caloric.food.frequently : chr "no" "no" "no" "no" ...
## $ vegetables.in.meals : num 2 3 2 3 2 2 3 2 3 2 ...
## $ number.of.main.meals : num 3 3 3 3 1 3 3 3 3 3 ...
## $ food.between.meals : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ smoke : chr "no" "yes" "no" "no" ...
## $ water.daily : num 2 3 2 2 2 2 2 2 2 2 ...
## $ monitor.calories : chr "no" "yes" "no" "no" ...
## $ physical.activity : num 0 3 2 2 0 0 1 3 1 1 ...
## $ tech.devices.usage : num 1 0 1 0 0 0 0 0 1 1 ...
## $ alcohol.frequency : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ transportation.type : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ NObeyesdad : chr "Normal_Weight" "Normal_Weight" "Normal_Weight" "Overweight_Level_I" ...
head(df)
## Gender Age Height Weight family_history_with_overweight
## 1 Female 21 1.62 64.0 yes
## 2 Female 21 1.52 56.0 yes
## 3 Male 23 1.80 77.0 yes
## 4 Male 27 1.80 87.0 no
## 5 Male 22 1.78 89.8 no
## 6 Male 29 1.62 53.0 no
## high.caloric.food.frequently vegetables.in.meals number.of.main.meals
## 1 no 2 3
## 2 no 3 3
## 3 no 2 3
## 4 no 3 3
## 5 no 2 1
## 6 yes 2 3
## food.between.meals smoke water.daily monitor.calories physical.activity
## 1 Sometimes no 2 no 0
## 2 Sometimes yes 3 yes 3
## 3 Sometimes no 2 no 2
## 4 Sometimes no 2 no 2
## 5 Sometimes no 2 no 0
## 6 Sometimes no 2 no 0
## tech.devices.usage alcohol.frequency transportation.type
## 1 1 no Public_Transportation
## 2 0 Sometimes Public_Transportation
## 3 1 Frequently Public_Transportation
## 4 0 Frequently Walking
## 5 0 Sometimes Public_Transportation
## 6 0 Sometimes Automobile
## NObeyesdad
## 1 Normal_Weight
## 2 Normal_Weight
## 3 Normal_Weight
## 4 Overweight_Level_I
## 5 Overweight_Level_II
## 6 Normal_Weight
df <- df[!duplicated(df), ]
table(df$Gender)
##
## Female Male
## 1035 1052
I aim to ensure an equal distribution of genders in the dataset to maintain the reliability of the results. With 1,035 females and 1,052 males, the distribution is balanced enough to proceed with the analysis.
Additionally, in order to apply the association rules to find pattern in obesity, I first need to transform the data from raw dataset. Character data will be recognize as factor, and each numerical data will be divide into meaningful categories.
Physical attributes such as height and weight play a crucial role in determining obesity levels, with the criteria for assessing obesity often varying by age and gender. To create a more intuitive and actionable feature for further research, I will calculate the Body Mass Index (BMI) for each individual.
BMI is a standardized metric widely recommended in population health studies and clinical settings. It provides a simple yet effective measure for evaluating an individual’s weight status relative to their height.
The formula for BMI is as follows: \[ \text{BMI} = \frac{\text{Weight (kg)}}{\text{Height (m)}^2} \] This calculation will support to classify individuals into meaningful categories and support a deeper analysis of obesity-related patterns.
df$BMI <- df$Weight / (df$Height^2)
In the next step, I will divide each numerical variables to 4 different categories based on the distribution of data, and recommendation of BMI categories from the U.S. Centers for Disease Control.
df$bmi_grp <- ifelse(df[, 18] < 18.5, "low BMI", ifelse(df[, 18] < 25, "normal BMI", ifelse(df[, 18] < 30, "high BMI", "very high BMI")))
df$age_grp <- ifelse(df[, 2] < 20, "under 20", ifelse(df[, 2] < 23, "under 23", ifelse(df[, 2] < 27, "23 to 27", "older than 27")))
df$main_meals <- ifelse(df[, 8] < 2, "under 2 meals", ifelse(df[, 8] < 3, "2 to 3 meals", ifelse(df[, 8] == 3, "3 meals per day", "more than 3 meals")))
df$h20_group <- ifelse(df[, 11] < 1.5, "less than 1.5 litres", ifelse(df[,11] < 2, "1.5 - 2 litres", ifelse(df[,11] < 2.5, "2 - 2.5 litres", "more than 2.5 litres")))
df$activity_level <- ifelse(df[, 13] == 0, "do not excersize", ifelse(df[, 13] < 1, "minimal excersize", ifelse(df[, 13] < 2, "moderate excersize", "active")))
df$tech_use <- ifelse(df[, 14] < 0.5, "barely use tech device", ifelse(df[, 14] < 1, "less than an hour", ifelse(df[, 14] < 2, "1 to 2 hours", "more than 2 hours")))
During the survey, there are multiple binary questions, which would cause duplicate values when applying the association rules, therefore, those answers will be also rename.
Within the dataset, the authors classified 2 types of overweight and 3 types of obesity. However, during this project, I aimed to find the factors that lead to obesity and overweight in general, therefore, I will group all obesity types into 1 and overweight into 1 category.
df <- df %>% mutate(
weight_type = case_when(
NObeyesdad %in% c("Normal_Weight", "Insufficient_Weight") ~ NObeyesdad,
NObeyesdad %in% c("Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III") ~ "Obesity",
NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II") ~ "Overweight",
TRUE ~ NObeyesdad))
I will then limit the data by choosing appropriate columns for further analysis, get rid of the variable regarding number of vegetables in meals. Checking the dataset before apply association rule.
transdf <- df[, c(1, 16, 19:31)]
head(transdf)
## Gender transportation.type bmi_grp age_grp main_meals
## 1 Female Public_Transportation normal BMI under 23 3 meals per day
## 2 Female Public_Transportation normal BMI under 23 3 meals per day
## 3 Male Public_Transportation normal BMI 23 to 27 3 meals per day
## 4 Male Walking high BMI older than 27 3 meals per day
## 5 Male Public_Transportation high BMI under 23 under 2 meals
## 6 Male Automobile normal BMI older than 27 3 meals per day
## h20_group activity_level tech_use fam_hist
## 1 2 - 2.5 litres do not excersize 1 to 2 hours has history
## 2 more than 2.5 litres active barely use tech device has history
## 3 2 - 2.5 litres active 1 to 2 hours has history
## 4 2 - 2.5 litres active barely use tech device no history
## 5 2 - 2.5 litres do not excersize barely use tech device no history
## 6 2 - 2.5 litres do not excersize barely use tech device no history
## high_cal is_smoke cal_monitor alc_freq
## 1 no high kcal food not smoke no monitor kcal do not drink alc
## 2 no high kcal food smoke monitor kcal sometimes drink alc
## 3 no high kcal food not smoke no monitor kcal drink alc frequently
## 4 no high kcal food not smoke no monitor kcal drink alc frequently
## 5 no high kcal food not smoke no monitor kcal sometimes drink alc
## 6 eat high kcal food not smoke no monitor kcal sometimes drink alc
## food_btmeal weight_type
## 1 sometimes eat between meals Normal_Weight
## 2 sometimes eat between meals Normal_Weight
## 3 sometimes eat between meals Normal_Weight
## 4 sometimes eat between meals Overweight
## 5 sometimes eat between meals Overweight
## 6 sometimes eat between meals Normal_Weight
write.csv(transdf, file="obese_analysis.csv", row.names = FALSE)
csv file will be imported by using the read.transactions() function to create a proper format for further analysis
transactions <- read.transactions("obese_analysis.csv", format="basket", sep=",", skip=1)
inspect(transactions[1:5])
## items
## [1] {1 to 2 hours,
## 2 - 2.5 litres,
## 3 meals per day,
## do not drink alc,
## do not excersize,
## Female,
## has history,
## no high kcal food,
## no monitor kcal,
## normal BMI,
## Normal_Weight,
## not smoke,
## Public_Transportation,
## sometimes eat between meals,
## under 23}
## [2] {3 meals per day,
## active,
## barely use tech device,
## Female,
## has history,
## monitor kcal,
## more than 2.5 litres,
## no high kcal food,
## normal BMI,
## Normal_Weight,
## Public_Transportation,
## smoke,
## sometimes drink alc,
## sometimes eat between meals,
## under 23}
## [3] {1 to 2 hours,
## 2 - 2.5 litres,
## 23 to 27,
## 3 meals per day,
## active,
## drink alc frequently,
## has history,
## Male,
## no high kcal food,
## no monitor kcal,
## normal BMI,
## Normal_Weight,
## not smoke,
## Public_Transportation,
## sometimes eat between meals}
## [4] {2 - 2.5 litres,
## 3 meals per day,
## active,
## barely use tech device,
## drink alc frequently,
## high BMI,
## Male,
## no high kcal food,
## no history,
## no monitor kcal,
## not smoke,
## older than 27,
## Overweight,
## sometimes eat between meals,
## Walking}
## [5] {2 - 2.5 litres,
## barely use tech device,
## do not excersize,
## high BMI,
## Male,
## no high kcal food,
## no history,
## no monitor kcal,
## not smoke,
## Overweight,
## Public_Transportation,
## sometimes drink alc,
## sometimes eat between meals,
## under 2 meals,
## under 23}
length(transactions)
## [1] 2087
How is all the item varies within this dataset, checking the item frequency to recognize the items that appear the most.
sorted_freq <- sort(itemFrequency(transactions, type="relative"))
sorted_freq
## always drink alc Bike
## 0.0004791567 0.0033540968
## Motorbike no food between meals
## 0.0052707235 0.0177287973
## smoke always eat between meals
## 0.0210828941 0.0253953043
## Walking drink alc frequently
## 0.0263536176 0.0335409679
## monitor kcal more than 2 hours
## 0.0459990417 0.0517489219
## more than 3 meals frequently eat between meals
## 0.1092477240 0.1130809775
## no high kcal food low BMI
## 0.1164350743 0.1274556780
## Insufficient_Weight Normal_Weight
## 0.1279348347 0.1351221850
## 1.5 - 2 litres 2 to 3 meals
## 0.1360804983 0.1365596550
## normal BMI no history
## 0.1413512218 0.1748921897
## active under 2 meals
## 0.1796837566 0.1796837566
## do not excersize Automobile
## 0.1940584571 0.2184954480
## older than 27 less than an hour
## 0.2204120747 0.2218495448
## less than 1.5 litres more than 2.5 litres
## 0.2290368951 0.2405366555
## under 20 23 to 27
## 0.2558696694 0.2606612362
## under 23 high BMI
## 0.2630570196 0.2644944897
## Overweight 1 to 2 hours
## 0.2712026833 0.2798275036
## minimal excersize do not drink alc
## 0.2874940105 0.3047436512
## moderate excersize 2 - 2.5 litres
## 0.3387637758 0.3943459511
## barely use tech device Obesity
## 0.4465740297 0.4657402971
## very high BMI Female
## 0.4666986104 0.4959271682
## Male 3 meals per day
## 0.5040728318 0.5745088644
## sometimes drink alc Public_Transportation
## 0.6612362242 0.7465261140
## has history sometimes eat between meals
## 0.8251078103 0.8437949209
## eat high kcal food no monitor kcal
## 0.8835649257 0.9540009583
## not smoke
## 0.9789171059
par(mar = c(10, 4, 4, 2))
barplot(sorted_freq, main = "Item Frequency", col = "skyblue", las = 2)
In general, there are 12.79% people that has insufficient weight, 13.5% is in normal weight range. Clearly, the number of people who are expose to overweight and obesity is significantly higher than those with normal and low weight since 27% are overweight and 46.57% has obesity.
Now, I would like to check the association rules in of each type of weight, starting with a reasonable level of support, then I will subset the rules to find the factors of obesity.
rules <- apriori(transactions, parameter = list(supp = 0.1, conf = 0.8, maxlen = 5),
appearance = list(default = "lhs", rhs = c("Obesity", "Overweight", "Normal_Weight", "Insufficient_Weight")), control=list(verbose=F))
summary(rules)
## set of 760 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 4 40 190 526
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 5.000 4.629 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1001 Min. :0.8007 Min. :0.1001 Min. :1.719
## 1st Qu.:0.1131 1st Qu.:0.9929 1st Qu.:0.1188 1st Qu.:2.135
## Median :0.1385 Median :0.9970 Median :0.1457 Median :2.147
## Mean :0.1673 Mean :0.9728 Mean :0.1714 Mean :2.394
## 3rd Qu.:0.1774 3rd Qu.:1.0000 3rd Qu.:0.1827 3rd Qu.:2.147
## Max. :0.4648 Max. :1.0000 Max. :0.4667 Max. :7.744
## count
## Min. :209.0
## 1st Qu.:236.0
## Median :289.0
## Mean :349.2
## 3rd Qu.:370.2
## Max. :970.0
##
## mining info:
## data ntransactions support confidence
## transactions 2087 0.1 0.8
## call
## apriori(data = transactions, parameter = list(supp = 0.1, conf = 0.8, maxlen = 5), appearance = list(default = "lhs", rhs = c("Obesity", "Overweight", "Normal_Weight", "Insufficient_Weight")), control = list(verbose = F))
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted))
## lhs rhs support confidence coverage lift count
## [1] {eat high kcal food,
## low BMI} => {Insufficient_Weight} 0.1020604 0.9906977 0.1030187 7.743768 213
## [2] {eat high kcal food,
## low BMI,
## not smoke} => {Insufficient_Weight} 0.1020604 0.9906977 0.1030187 7.743768 213
## [3] {low BMI} => {Insufficient_Weight} 0.1260182 0.9887218 0.1274557 7.728324 263
## [4] {low BMI,
## not smoke} => {Insufficient_Weight} 0.1255391 0.9886792 0.1269765 7.727991 262
## [5] {low BMI,
## no monitor kcal} => {Insufficient_Weight} 0.1154768 0.9877049 0.1169142 7.720375 241
## [6] {low BMI,
## no monitor kcal,
## not smoke} => {Insufficient_Weight} 0.1149976 0.9876543 0.1164351 7.719980 240
plot(rules, method = "grouped")
From the summary, there are 760 rules were mined by using Apriori algorithm, most of the rules range from 2 to 5 items. The support range from 1% to 46.5%, with the confidence is high, coming from 0.8 to 1, suggest great predictability of the right hand side from left hand side.
However, as there are still many rules which fall into the low support range, meaning they occur only in a few transactions. Hence, I will clean the rules by remove the redundant rules.
strong_rules <- rules[!is.redundant(rules)]
strong_rules <- strong_rules[(is.maximal(strong_rules))]
summary(strong_rules)
## set of 120 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5
## 13 19 88
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 5.000 4.625 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1001 Min. :0.8007 Min. :0.1001 Min. :1.719
## 1st Qu.:0.1048 1st Qu.:0.8232 1st Qu.:0.1198 1st Qu.:1.768
## Median :0.1248 Median :0.8684 Median :0.1354 Median :1.864
## Mean :0.1326 Mean :0.8930 Mean :0.1485 Mean :2.070
## 3rd Qu.:0.1443 3rd Qu.:0.9920 3rd Qu.:0.1672 3rd Qu.:2.147
## Max. :0.3598 Max. :1.0000 Max. :0.3598 Max. :7.744
## count
## Min. :209.0
## 1st Qu.:218.8
## Median :260.5
## Mean :276.6
## 3rd Qu.:301.2
## Max. :751.0
##
## mining info:
## data ntransactions support confidence
## transactions 2087 0.1 0.8
## call
## apriori(data = transactions, parameter = list(supp = 0.1, conf = 0.8, maxlen = 5), appearance = list(default = "lhs", rhs = c("Obesity", "Overweight", "Normal_Weight", "Insufficient_Weight")), control = list(verbose = F))
par(mar = c(10, 4, 4, 2) + 0.1)
plot(strong_rules, method = "grouped")
After cleaning, I now have 120 rules in total. Based on the group matrix plot, the larger the bubble, the higher support it has. Along with that, the color of bubble indicates the lift of the rules. A stronger lift means a stronger association between LHS and RHS.
I would like to see the association rules for people with overweight.
overweight_rules <- subset(strong_rules, rhs %in% "Overweight")
inspectDT(overweight_rules)
plot(overweight_rules, method = "graph", control = list(type = "items"))
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
For better illustrate the connection of the left-hand side and right-hand side, I will plot Sankey diagram, with the width of arrow corresponds to the support level of each rule, showing how the LHS contribute to the outcome.
library(networkD3)
#convert rules into a data frame and extract the lhs of the rules
rule_data <- as(overweight_rules, "data.frame")
rule_data$lhs <- gsub("\\{|\\}", "", rule_data$rules)
rule_data$lhs <- sapply(strsplit(rule_data$lhs, " => "), "[", 1)
rule_data$rhs <- "Overweight"
nodes <- data.frame(name = unique(c(rule_data$lhs, rule_data$rhs)))
links <- data.frame(source = match(rule_data$lhs, nodes$name) - 1,
target = match(rule_data$rhs, nodes$name) - 1, value = rule_data$support)
sankeyNetwork(Links = links, Nodes = nodes, Source = "source",
Target = "target",Value = "value", NodeID = "name",
units = "Support",fontSize = 12,nodeWidth = 30)
As I can see, high BMI is central and strongly connected to overweight, and it can be consider as one of the most significant contributors to define people in overweight category. Additionally, the dark red color from the graph indicates a high lift, signifying that individuals with high BMI are much more likely to be classified as overweight compared to the baseline probability.
On the other hand, males are more likely to fall into the overweight category than females, and people who are overweight tend to eat 3 meals per day.
Other factor to consider is alcohol consumption, even though it has a lower support, indicates a lesser influence overall. However, it still plays a role as part of the rules combination contributing to being overweight.
The association rules for individuals with obesity will be checked.
obese_rules <- subset(strong_rules, rhs %in% "Obesity")
inspectDT(obese_rules)
There are total of 113 rules for obesity, to better look into the relation of it, I will visualize them on the plots.
plot(obese_rules, method = "graph", control = list(type = "items"), main = "Obesity rules ")
## Warning: Unknown control parameters: type, main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
## Warning: Too many rules supplied. Only plotting the best 100 using 'lift'
## (change control parameter max if needed).
plot(obese_rules, method = "paracoord", control = list(type = "items"), main = "Parallel coordinates plot for obesity rules")
## Warning: Unknown control parameters: type
## Available control parameters (with default values):
## main = Parallel coordinates plot for 113 rules
## reorder = TRUE
## interactive = FALSE
## engine = default
## gp_labels = list()
## newpage = TRUE
## col = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## alpha = NULL
## quality = 2
## verbose = FALSE
As illustrated from the plots and with the level of confidence equal 100%, the prominent factor connect to obesity is very high BMI value, which in fact, BMI is a valid indicator to define obesity.
The other lifestyle and demographic factors that interact with “Very High BMI” are dietary habits, activity levels. With the second highest support of 34%, people who consume high-calorie foods, and sometimes drink alcohol strongly associate with obesity. Besides that, people that use public transportation, eat high calories food, along with very high BMI also appear to have obesity (support: 36% and lift: 2.147)
Individuals aged above 23 frequently appear in the rules, suggests a shift in weight-related behaviors or metabolism in these age groups.
For more visualization of the result, I have also plot the word cloud showing association rules of obesity.
#extract lhs items as string
lhs <- gsub("[{}]", "", labels(lhs(obese_rules)))
#split them into individual items for plotting
split_lhs<- unlist(strsplit(lhs, ","))
lhsfreq <- table(split_lhs)
set.seed(42)
wordcloud(words = names(lhsfreq), freq = as.numeric(lhsfreq),
min.freq = 2, max.words = 100, random.order = FALSE, colors = brewer.pal(8, "Dark2"), scale = c(2, 0.5))
Also, it is worth mentioning that individuals with a family history of obesity are more likely to develop obesity themselves. From the word cloud plot, I can notice that female appears more in the rules associated with obesity, which contrasts with the rules for overweight, where males are more prominent.
Interestingly, the use of technological devices shows minimal interaction with obesity, which is surprising for me because I thought individuals with obesity would use technical devices such as TV, or mobile phones frequently.
To manage physical health effectively, monitoring calories intake is a crucial factor. However, this aspect is often overlooked among individuals in the overweight and obesity groups.
Overweight is a precursor into obesity, hence, when people are getting signal of this, they should pay attention to adjust their lifestyle properly. Individuals should get encourage to frequently monitor their weight and BMI to identify early signs of overweight and obesity.
For businesses, publishing the information regarding calories per dish could better improve the awareness of calories intake, which is crucial for those with the risk of obesity.
For the government, it will be great to implement health promotion programs that provide incentives for healthy behaviors by collaborate with schools, workplaces, and communities, encourage people to stay active, consume less alcohol, and potentially, provide nutritional counseling and fitness programs aimed at breaking the cycle of obesity within families.
The Data set came from UC Irvine Machine Learning Repository, which can be found here Eduardo De-La-Hoz-Correa, Fabio E. Mendoza-Palechor, Alexis De-La-Hoz-Manotas, Roberto C. Morales-Ortegaand Sánchez Hernández Beatriz Adriana (2019) Obesity Level Estimation Software based on Decision Trees
Body Mass Index categories from the U.S. Centers for Disease Control and Prevention, which can be found here
Marc-André Cornier, MD (2022) A Review of Current Guidelines for the Treatment of Obesity