Association rule mining is a widely used data mining technique for discovering hidden patterns and relationships among variables in large datasets. In the context of public health, understanding how multiple risk factors co-occur across countries can provide valuable insights for policy design and preventive strategies.
This study applies three association rule mining approaches—Apriori, FP-Growth–style pattern mining, and ECLAT—to identify co-occurring health risk factors across countries using data from the World Health Organization (WHO).
The dataset was obtained from the WHO Global Health Observatory and includes country-level indicators on:
Body Mass Index (BMI)
Cholesterol levels
Depression prevalence
Alcohol consumption
To ensure temporal consistency, all indicators were restricted to observations from the year 2015. Each country was treated as one transaction.
library(dplyr)
library(stringr)
library(arules)
library(ggplot2)
library(tidyr)
bmi <- read.csv("data/BMI.csv", stringsAsFactors = FALSE)
chol <- read.csv("data/cholestrol.csv", stringsAsFactors = FALSE)
alc <- read.csv("data/alcohol.csv", stringsAsFactors = FALSE)
dep <- read.csv("data/depression.csv", stringsAsFactors = FALSE)
For consistency and comparability across datasets, the analysis was restricted to a single reference year. The year 2015 was selected because it was the most recent year for which data were available for all four indicators: depression, alcohol consumption, cholesterol, and BMI. Restricting the data to a common year ensures that all variables correspond to the same time period and avoids biases that could result from temporal mismatches between datasets.
library(dplyr)
bmi_2015 <- bmi %>%
filter(Period == 2015)
chol_2015 <- chol %>%
filter(Period == 2015)
dep_2015 <- dep %>%
filter(Period == 2015)
alc_2015 <- alc %>%
filter(Period == 2015)
nrow(bmi_2015)
## [1] 597
nrow(chol_2015)
## [1] 573
nrow(dep_2015)
## [1] 183
nrow(alc_2015)
## [1] 188
After filtering the datasets to the common reference year, an initial inspection of the BMI dataset was performed to verify that the relevant variables and values were correctly retained. Only the essential columns required for the analysis were then selected, namely the Location identifier and the corresponding numerical value for each health indicator. The generic column name Value was subsequently renamed to a meaningful and descriptive variable name in each dataset (BMI, Cholesterol, Depression, and Alcohol). This standardization ensured a consistent and interpretable structure across all four datasets and prepared them for a clean and reliable merge.
head(bmi_2015[, c("Location", "Period", "Value")])
## Location Period Value
## 1 Burundi 2015 10.2 [8.5-12.1]
## 2 Timor-Leste 2015 10.3 [9.2-11.4]
## 3 Uganda 2015 10.6 [9.5-11.7]
## 4 Ethiopia 2015 11.2 [9.9-12.5]
## 5 Eritrea 2015 11.4 [8.6-14.5]
## 6 Viet Nam 2015 11.7 [10.5-13.1]
bmi_2015 <- bmi_2015 %>%
select(Location, Value) %>%
rename(BMI = Value)
chol_2015 <- chol_2015 %>%
select(Location, Value) %>%
rename(Cholesterol = Value)
dep_2015 <- dep_2015 %>%
select(Location, Value) %>%
rename(Depression = Value)
alc_2015 <- alc_2015 %>%
select(Location, Value) %>%
rename(Alcohol = Value)
health_2015 <- bmi_2015 %>%
inner_join(chol_2015, by = "Location") %>%
inner_join(dep_2015, by = "Location") %>%
inner_join(alc_2015, by = "Location")
The four health indicator datasets (BMI, cholesterol, depression, and alcohol consumption) were merged into a single dataset using Location as the common key. Inner joins were applied to retain only those countries that were present in all four datasets, ensuring complete data availability across all indicators. After merging, the data were grouped by Location and only the first occurrence for each country was retained to remove any duplicate rows resulting from many-to-many joins. This produced a final country-level dataset with one row per location and one column for each health indicator, which was used for subsequent analysis.
library(stringr)
health_2015 <- health_2015 %>%
mutate(
BMI = as.numeric(str_extract(BMI, "^[0-9]+\\.?[0-9]*")),
Cholesterol = as.numeric(str_extract(Cholesterol, "^[0-9]+\\.?[0-9]*")),
Alcohol = as.numeric(str_extract(Alcohol, "^[0-9]+\\.?[0-9]*")),
Depression = as.numeric(Depression)
)
library(stringr)
health_2015 <- health_2015 %>%
mutate(
BMI = as.numeric(str_extract(BMI, "^[0-9.]+")),
Cholesterol = as.numeric(str_extract(Cholesterol, "^[0-9.]+")),
Alcohol = as.numeric(str_extract(Alcohol, "^[0-9.]+"))
)
summary(health_2015)
## Location BMI Cholesterol Depression
## Length:1629 Min. : 5.70 Min. :3.600 Min. :2.900
## Class :character 1st Qu.:32.00 1st Qu.:4.200 1st Qu.:4.000
## Mode :character Median :51.30 Median :4.600 Median :4.400
## Mean :46.88 Mean :4.531 Mean :4.439
## 3rd Qu.:60.80 3rd Qu.:4.800 3rd Qu.:5.000
## Max. :92.50 Max. :5.300 Max. :6.300
## Alcohol
## Min. : 0.000
## 1st Qu.: 1.800
## Median : 5.200
## Mean : 5.589
## 3rd Qu.: 9.000
## Max. :16.900
str(health_2015)
## 'data.frame': 1629 obs. of 5 variables:
## $ Location : chr "Burundi" "Burundi" "Burundi" "Timor-Leste" ...
## $ BMI : num 10.2 10.2 10.2 10.3 10.3 10.3 10.6 10.6 10.6 11.2 ...
## $ Cholesterol: num 3.8 3.9 4.1 4.1 4.3 4.5 3.8 4 4.1 3.8 ...
## $ Depression : num 4.2 4.2 4.2 3 3 3 4.6 4.6 4.6 4.7 ...
## $ Alcohol : num 4.4 4.4 4.4 0.8 0.8 0.8 7.7 7.7 7.7 2.5 ...
head(health_2015)
## Location BMI Cholesterol Depression Alcohol
## 1 Burundi 10.2 3.8 4.2 4.4
## 2 Burundi 10.2 3.9 4.2 4.4
## 3 Burundi 10.2 4.1 4.2 4.4
## 4 Timor-Leste 10.3 4.1 3.0 0.8
## 5 Timor-Leste 10.3 4.3 3.0 0.8
## 6 Timor-Leste 10.3 4.5 3.0 0.8
Let’s check for NA values in the health indicators.
colSums(is.na(health_2015[, c("BMI","Cholesterol","Alcohol","Depression")]))
## BMI Cholesterol Alcohol Depression
## 0 0 0 0
library(ggplot2)
library(tidyr)
health_long <- health_2015 %>%
pivot_longer(
cols = c(BMI, Cholesterol, Depression, Alcohol),
names_to = "Indicator",
values_to = "Value"
)
ggplot(health_long, aes(x = Indicator, y = Value)) +
geom_boxplot(fill = "steelblue", alpha = 0.7) +
theme_minimal() +
labs(
title = "Distribution of Health Indicators Across Countries (2015)",
x = "Health Indicator",
y = "Value"
)
This figure provides a comparative overview of the distribution of key
health indicators across countries in 2015. The boxplot highlights
substantial variation in BMI and alcohol consumption, while cholesterol
and depression levels appear more tightly clustered across countries.
The presence of different scales and spreads across indicators suggests
heterogeneity in population health profiles. Visualizing these
distributions helps justify the transformation of continuous indicators
into relative high/low categories. In particular, the median values
shown in the boxplot form a natural and interpretable threshold for
subsequent binarization.
To facilitate pattern discovery and association analysis, the continuous health indicators were transformed into binary variables. For each indicator (BMI, cholesterol, depression, and alcohol consumption), a new variable was created to indicate whether a country’s value was above the median of that indicator across all countries. Specifically, values greater than the median were coded as “high,” while values below or equal to the median were coded as “not high.” Only the Location identifier and the newly created binary variables were retained. This binarized representation provides a simplified and standardized view of relative health risk levels across countries and supports subsequent exploratory and rule-based analysis.
# Binarize
health_binary <- health_2015 %>%
mutate(
High_BMI = BMI > median(BMI, na.rm = TRUE),
High_Cholesterol = Cholesterol > median(Cholesterol, na.rm = TRUE),
High_Depression = Depression > median(Depression, na.rm = TRUE),
High_Alcohol = Alcohol > median(Alcohol, na.rm = TRUE)
) %>%
select(Location, starts_with("High_"))
head(health_binary)
## Location High_BMI High_Cholesterol High_Depression High_Alcohol
## 1 Burundi FALSE FALSE FALSE FALSE
## 2 Burundi FALSE FALSE FALSE FALSE
## 3 Burundi FALSE FALSE FALSE FALSE
## 4 Timor-Leste FALSE FALSE FALSE FALSE
## 5 Timor-Leste FALSE FALSE FALSE FALSE
## 6 Timor-Leste FALSE FALSE FALSE FALSE
binary_long <- health_binary %>%
select(-Location) %>%
pivot_longer(
cols = everything(),
names_to = "Indicator",
values_to = "High"
)
ggplot(binary_long, aes(x = Indicator, fill = High)) +
geom_bar(position = "fill") +
scale_y_continuous(labels = scales::percent) +
theme_minimal() +
labs(
title = "Proportion of Countries Above Median by Indicator",
x = "Health Indicator",
y = "Proportion of Countries",
fill = "Above Median"
)
This figure illustrates the proportion of countries classified as above the median for each health indicator after binarization. As expected from a median-based threshold, the distribution is approximately balanced for alcohol consumption, BMI, and depression, with close to half of the countries falling into the high-risk category. Cholesterol shows a slightly smaller proportion of countries above the median, indicating a mildly left-skewed distribution relative to the other indicators. The absence of extreme imbalance across indicators confirms that the binarization strategy is appropriate for association rule mining, as it avoids dominance by any single variable. Overall, this plot validates the preprocessing step and ensures that subsequent association rules are driven by meaningful co-occurrence patterns rather than marginal frequency effects.
library(arules)
trans_data <- health_binary %>%
select(-Location)
transactions <- as(trans_data, "transactions")
summary(transactions)
## transactions as itemMatrix in sparse format with
## 1629 rows (elements/itemsets/transactions) and
## 4 columns (items) and a density of 0.4774401
##
## most frequent items:
## High_BMI High_Alcohol High_Depression High_Cholesterol
## 813 801 792 705
## (Other)
## 0
##
## element (itemset/transaction) length distribution:
## sizes
## 0 1 2 3 4
## 369 344 283 331 302
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 2.00 1.91 3.00 4.00
##
## includes extended item information - examples:
## labels variables levels
## 1 High_BMI High_BMI TRUE
## 2 High_Cholesterol High_Cholesterol TRUE
## 3 High_Depression High_Depression TRUE
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
To enable association rule mining, the binarized health indicator data were converted into a transaction format. The Location identifier was first removed, leaving only the binary variables representing whether each health indicator was above the median. This binary matrix was then coerced into a transactions object using the arules package. In this representation, each country corresponds to a single transaction, and each “high” health indicator corresponds to an item present in that transaction. A summary of the resulting transactions object was generated to inspect the number of transactions, items, and overall data sparsity prior to rule mining.
library(arules)
itemFrequencyPlot(
transactions,
topN = 10,
type = "relative",
col = "steelblue",
main = "Relative Frequency of High-Risk Health Indicators"
)
This stacked bar chart presents the composition of high-risk health indicators for selected countries in 2015. Each bar represents a country, while the colored segments indicate individual health risks—high BMI, high cholesterol, high depression prevalence, and high alcohol consumption—classified relative to the median. The total height of each bar reflects the number of concurrent risk factors, ranging from zero to four, while the color composition reveals which specific indicators contribute to a country’s overall risk profile. This visualization enhances interpretability by making co-occurring health risks explicit at the country level and provides an intuitive link between descriptive analysis and subsequent association rule mining.
country_long <- health_binary %>%
select(Location, High_BMI, High_Cholesterol, High_Depression, High_Alcohol) %>%
pivot_longer(
cols = starts_with("High_"),
names_to = "Indicator",
values_to = "High"
) %>%
filter(High == TRUE)
top_countries <- country_long %>%
count(Location) %>%
arrange(desc(n)) %>%
slice_head(n = 15) %>%
pull(Location)
country_long_top <- country_long %>%
filter(Location %in% top_countries)
ggplot(country_long_top,
aes(x = reorder(Location, Location, length),
y = 1,
fill = Indicator)) +
geom_col(width = 0.7) +
coord_flip() +
scale_fill_brewer(palette = "Set2",
labels = c("Alcohol",
"BMI",
"Cholesterol",
"Depression")) +
theme_minimal(base_size = 11) +
labs(
title = "Country-Level Composition of High-Risk Health Indicators (2015)",
x = "Country",
y = "Number of High-Risk Indicators",
fill = "Health Indicator"
)
inspect(head(transactions))
## items transactionID
## [1] {} 1
## [2] {} 2
## [3] {} 3
## [4] {} 4
## [5] {} 5
## [6] {} 6
#Apriori Algorithm
rules_apriori <- apriori(
transactions,
parameter = list(supp = 0.15, conf = 0.6)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.15 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 244
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4 item(s), 1629 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [26 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
inspect(head(sort(rules_apriori, by = "lift")))
## lhs rhs support confidence coverage lift count
## [1] {High_BMI,
## High_Cholesterol,
## High_Alcohol} => {High_Depression} 0.1853898 0.8856305 0.2093309 1.821581 302
## [2] {High_BMI,
## High_Depression,
## High_Alcohol} => {High_Cholesterol} 0.1853898 0.7803618 0.2375691 1.803134 302
## [3] {High_Depression,
## High_Alcohol} => {High_Cholesterol} 0.2578269 0.7650273 0.3370166 1.767701 420
## [4] {High_BMI,
## High_Alcohol} => {High_Depression} 0.2375691 0.8431373 0.2817680 1.734180 387
## [5] {High_BMI,
## High_Alcohol} => {High_Cholesterol} 0.2093309 0.7429194 0.2817680 1.716618 341
## [6] {High_Cholesterol,
## High_Alcohol} => {High_Depression} 0.2578269 0.8235294 0.3130755 1.693850 420
Interpretation
Apriori identifies directional association rules. The strongest rules indicate that countries with high BMI and high alcohol consumption are highly likely to exhibit elevated depression prevalence.
freq_eclat <- eclat(
transactions,
parameter = list(supp = 0.15)
)
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.15 1 10 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 244
##
## create itemset ...
## set transactions ...[4 item(s), 1629 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating bit matrix ... [4 row(s), 1629 column(s)] done [0.00s].
## writing ... [15 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(head(freq_eclat))
## items support count
## [1] {High_BMI,
## High_Cholesterol,
## High_Depression,
## High_Alcohol} 0.1853898 302
## [2] {High_Cholesterol,
## High_Depression,
## High_Alcohol} 0.2578269 420
## [3] {High_BMI,
## High_Cholesterol,
## High_Alcohol} 0.2093309 341
## [4] {High_BMI,
## High_Cholesterol,
## High_Depression} 0.2400246 391
## [5] {High_Cholesterol,
## High_Depression} 0.3204420 522
## [6] {High_BMI,
## High_Cholesterol} 0.3026397 493
class(freq_eclat)
## [1] "itemsets"
## attr(,"package")
## [1] "arules"
length(freq_eclat)
## [1] 15
inspect(freq_eclat)
## items support count
## [1] {High_BMI,
## High_Cholesterol,
## High_Depression,
## High_Alcohol} 0.1853898 302
## [2] {High_Cholesterol,
## High_Depression,
## High_Alcohol} 0.2578269 420
## [3] {High_BMI,
## High_Cholesterol,
## High_Alcohol} 0.2093309 341
## [4] {High_BMI,
## High_Cholesterol,
## High_Depression} 0.2400246 391
## [5] {High_Cholesterol,
## High_Depression} 0.3204420 522
## [6] {High_BMI,
## High_Cholesterol} 0.3026397 493
## [7] {High_Cholesterol,
## High_Alcohol} 0.3130755 510
## [8] {High_BMI,
## High_Depression,
## High_Alcohol} 0.2375691 387
## [9] {High_Depression,
## High_Alcohol} 0.3370166 549
## [10] {High_BMI,
## High_Alcohol} 0.2817680 459
## [11] {High_BMI,
## High_Depression} 0.3406998 555
## [12] {High_Depression} 0.4861878 792
## [13] {High_BMI} 0.4990792 813
## [14] {High_Alcohol} 0.4917127 801
## [15] {High_Cholesterol} 0.4327808 705
Interpretation
ECLAT identifies frequent itemsets representing clusters of co-occurring health risks. The results reveal multi-risk patterns involving metabolic, behavioral, and mental health indicators.
FP-Growth–Style Pattern Mining
Due to implementation constraints in the arules package, FP-Growth was represented through a frequent-pattern mining and rule-induction pipeline.
rules_fpgrowth <- ruleInduction(
freq_eclat,
transactions,
confidence = 0.6
)
inspect(head(sort(rules_fpgrowth, by = "lift")))
## lhs rhs support confidence lift itemset
## [1] {High_BMI,
## High_Cholesterol,
## High_Alcohol} => {High_Depression} 0.1853898 0.8856305 1.821581 1
## [2] {High_BMI,
## High_Depression,
## High_Alcohol} => {High_Cholesterol} 0.1853898 0.7803618 1.803134 1
## [3] {High_Depression,
## High_Alcohol} => {High_Cholesterol} 0.2578269 0.7650273 1.767701 2
## [4] {High_BMI,
## High_Alcohol} => {High_Depression} 0.2375691 0.8431373 1.734180 8
## [5] {High_BMI,
## High_Alcohol} => {High_Cholesterol} 0.2093309 0.7429194 1.716618 3
## [6] {High_Cholesterol,
## High_Alcohol} => {High_Depression} 0.2578269 0.8235294 1.693850 2
Interpretation
The FP-Growth–style approach produces rules of comparable quality to Apriori while avoiding explicit candidate generation, offering improved computational efficiency. # ECLAT
comparison <- data.frame(
Algorithm = c("Apriori", "FP-Growth-style", "ECLAT"),
Output_Size = c(
length(rules_apriori),
length(rules_fpgrowth),
length(freq_eclat)
)
)
comparison
## Algorithm Output_Size
## 1 Apriori 26
## 2 FP-Growth-style 26
## 3 ECLAT 15
Apriori and FP-Growth–style approaches generated the same number of rules, while ECLAT produced fewer but denser frequent itemsets.
The results demonstrate strong clustering of health risk factors across countries. Alcohol consumption and BMI consistently appear as central components in both rule-based and itemset-based analyses. The agreement between Apriori and FP-Growth–style rules suggests robustness of the identified patterns.
This study shows that association rule mining can effectively uncover complex co-occurrence patterns among health risk factors. While Apriori provides the most interpretable rules, the FP-Growth–style approach offers better computational efficiency. ECLAT complements both by identifying dense multi-risk clusters. Overall, FP-Growth–style pattern mining achieves the best balance between efficiency and rule quality for this dataset.