1. Introduction

Association rule mining is a widely used data mining technique for discovering hidden patterns and relationships among variables in large datasets. In the context of public health, understanding how multiple risk factors co-occur across countries can provide valuable insights for policy design and preventive strategies.

This study applies three association rule mining approaches—Apriori, FP-Growth–style pattern mining, and ECLAT—to identify co-occurring health risk factors across countries using data from the World Health Organization (WHO).

2. Data Description

The dataset was obtained from the WHO Global Health Observatory and includes country-level indicators on:

Body Mass Index (BMI)

Cholesterol levels

Depression prevalence

Alcohol consumption

To ensure temporal consistency, all indicators were restricted to observations from the year 2015. Each country was treated as one transaction.

3. Data Preprocessing

library(dplyr)
library(stringr)
library(arules)
library(ggplot2)
library(tidyr)

bmi  <- read.csv("data/BMI.csv", stringsAsFactors = FALSE)
chol <- read.csv("data/cholestrol.csv", stringsAsFactors = FALSE)
alc  <- read.csv("data/alcohol.csv", stringsAsFactors = FALSE)
dep  <- read.csv("data/depression.csv", stringsAsFactors = FALSE)

Filter for 2015 and select relevant columns

For consistency and comparability across datasets, the analysis was restricted to a single reference year. The year 2015 was selected because it was the most recent year for which data were available for all four indicators: depression, alcohol consumption, cholesterol, and BMI. Restricting the data to a common year ensures that all variables correspond to the same time period and avoids biases that could result from temporal mismatches between datasets.

library(dplyr)

bmi_2015 <- bmi %>%
  filter(Period == 2015)

chol_2015 <- chol %>%
  filter(Period == 2015)

dep_2015 <- dep %>%
  filter(Period == 2015)

alc_2015 <- alc %>%
  filter(Period == 2015)
nrow(bmi_2015)
## [1] 597
nrow(chol_2015)
## [1] 573
nrow(dep_2015)
## [1] 183
nrow(alc_2015)  
## [1] 188

After filtering the datasets to the common reference year, an initial inspection of the BMI dataset was performed to verify that the relevant variables and values were correctly retained. Only the essential columns required for the analysis were then selected, namely the Location identifier and the corresponding numerical value for each health indicator. The generic column name Value was subsequently renamed to a meaningful and descriptive variable name in each dataset (BMI, Cholesterol, Depression, and Alcohol). This standardization ensured a consistent and interpretable structure across all four datasets and prepared them for a clean and reliable merge.

head(bmi_2015[, c("Location", "Period", "Value")])
##      Location Period            Value
## 1     Burundi   2015  10.2 [8.5-12.1]
## 2 Timor-Leste   2015  10.3 [9.2-11.4]
## 3      Uganda   2015  10.6 [9.5-11.7]
## 4    Ethiopia   2015  11.2 [9.9-12.5]
## 5     Eritrea   2015  11.4 [8.6-14.5]
## 6    Viet Nam   2015 11.7 [10.5-13.1]
bmi_2015 <- bmi_2015 %>%
  select(Location, Value) %>%
  rename(BMI = Value)

chol_2015 <- chol_2015 %>%
  select(Location, Value) %>%
  rename(Cholesterol = Value)

dep_2015 <- dep_2015 %>%
  select(Location, Value) %>%
  rename(Depression = Value)

alc_2015 <- alc_2015 %>%
  select(Location, Value) %>%
  rename(Alcohol = Value)
health_2015 <- bmi_2015 %>%
  inner_join(chol_2015, by = "Location") %>%
  inner_join(dep_2015, by = "Location") %>%
  inner_join(alc_2015, by = "Location")  

The four health indicator datasets (BMI, cholesterol, depression, and alcohol consumption) were merged into a single dataset using Location as the common key. Inner joins were applied to retain only those countries that were present in all four datasets, ensuring complete data availability across all indicators. After merging, the data were grouped by Location and only the first occurrence for each country was retained to remove any duplicate rows resulting from many-to-many joins. This produced a final country-level dataset with one row per location and one column for each health indicator, which was used for subsequent analysis.

library(stringr)

health_2015 <- health_2015 %>%
  mutate(
    BMI = as.numeric(str_extract(BMI, "^[0-9]+\\.?[0-9]*")),
    Cholesterol = as.numeric(str_extract(Cholesterol, "^[0-9]+\\.?[0-9]*")),
    Alcohol = as.numeric(str_extract(Alcohol, "^[0-9]+\\.?[0-9]*")),
    Depression = as.numeric(Depression)
  )
library(stringr)

health_2015 <- health_2015 %>%
  mutate(
    BMI = as.numeric(str_extract(BMI, "^[0-9.]+")),
    Cholesterol = as.numeric(str_extract(Cholesterol, "^[0-9.]+")),
    Alcohol = as.numeric(str_extract(Alcohol, "^[0-9.]+"))
  )

summary(health_2015)
##    Location              BMI         Cholesterol      Depression   
##  Length:1629        Min.   : 5.70   Min.   :3.600   Min.   :2.900  
##  Class :character   1st Qu.:32.00   1st Qu.:4.200   1st Qu.:4.000  
##  Mode  :character   Median :51.30   Median :4.600   Median :4.400  
##                     Mean   :46.88   Mean   :4.531   Mean   :4.439  
##                     3rd Qu.:60.80   3rd Qu.:4.800   3rd Qu.:5.000  
##                     Max.   :92.50   Max.   :5.300   Max.   :6.300  
##     Alcohol      
##  Min.   : 0.000  
##  1st Qu.: 1.800  
##  Median : 5.200  
##  Mean   : 5.589  
##  3rd Qu.: 9.000  
##  Max.   :16.900
str(health_2015)
## 'data.frame':    1629 obs. of  5 variables:
##  $ Location   : chr  "Burundi" "Burundi" "Burundi" "Timor-Leste" ...
##  $ BMI        : num  10.2 10.2 10.2 10.3 10.3 10.3 10.6 10.6 10.6 11.2 ...
##  $ Cholesterol: num  3.8 3.9 4.1 4.1 4.3 4.5 3.8 4 4.1 3.8 ...
##  $ Depression : num  4.2 4.2 4.2 3 3 3 4.6 4.6 4.6 4.7 ...
##  $ Alcohol    : num  4.4 4.4 4.4 0.8 0.8 0.8 7.7 7.7 7.7 2.5 ...
head(health_2015)
##      Location  BMI Cholesterol Depression Alcohol
## 1     Burundi 10.2         3.8        4.2     4.4
## 2     Burundi 10.2         3.9        4.2     4.4
## 3     Burundi 10.2         4.1        4.2     4.4
## 4 Timor-Leste 10.3         4.1        3.0     0.8
## 5 Timor-Leste 10.3         4.3        3.0     0.8
## 6 Timor-Leste 10.3         4.5        3.0     0.8

Let’s check for NA values in the health indicators.

colSums(is.na(health_2015[, c("BMI","Cholesterol","Alcohol","Depression")]))
##         BMI Cholesterol     Alcohol  Depression 
##           0           0           0           0
library(ggplot2)
library(tidyr)

health_long <- health_2015 %>%
  pivot_longer(
    cols = c(BMI, Cholesterol, Depression, Alcohol),
    names_to = "Indicator",
    values_to = "Value"
  )

ggplot(health_long, aes(x = Indicator, y = Value)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  theme_minimal() +
  labs(
    title = "Distribution of Health Indicators Across Countries (2015)",
    x = "Health Indicator",
    y = "Value"
  )

This figure provides a comparative overview of the distribution of key health indicators across countries in 2015. The boxplot highlights substantial variation in BMI and alcohol consumption, while cholesterol and depression levels appear more tightly clustered across countries. The presence of different scales and spreads across indicators suggests heterogeneity in population health profiles. Visualizing these distributions helps justify the transformation of continuous indicators into relative high/low categories. In particular, the median values shown in the boxplot form a natural and interpretable threshold for subsequent binarization.

Binarization of Health Indicators

To facilitate pattern discovery and association analysis, the continuous health indicators were transformed into binary variables. For each indicator (BMI, cholesterol, depression, and alcohol consumption), a new variable was created to indicate whether a country’s value was above the median of that indicator across all countries. Specifically, values greater than the median were coded as “high,” while values below or equal to the median were coded as “not high.” Only the Location identifier and the newly created binary variables were retained. This binarized representation provides a simplified and standardized view of relative health risk levels across countries and supports subsequent exploratory and rule-based analysis.

# Binarize
health_binary <- health_2015 %>%
  mutate(
    High_BMI         = BMI         > median(BMI, na.rm = TRUE),
    High_Cholesterol = Cholesterol > median(Cholesterol, na.rm = TRUE),
    High_Depression  = Depression  > median(Depression, na.rm = TRUE),
    High_Alcohol     = Alcohol     > median(Alcohol, na.rm = TRUE)
  ) %>%
  select(Location, starts_with("High_"))

head(health_binary)
##      Location High_BMI High_Cholesterol High_Depression High_Alcohol
## 1     Burundi    FALSE            FALSE           FALSE        FALSE
## 2     Burundi    FALSE            FALSE           FALSE        FALSE
## 3     Burundi    FALSE            FALSE           FALSE        FALSE
## 4 Timor-Leste    FALSE            FALSE           FALSE        FALSE
## 5 Timor-Leste    FALSE            FALSE           FALSE        FALSE
## 6 Timor-Leste    FALSE            FALSE           FALSE        FALSE
binary_long <- health_binary %>%
  select(-Location) %>%
  pivot_longer(
    cols = everything(),
    names_to = "Indicator",
    values_to = "High"
  )

ggplot(binary_long, aes(x = Indicator, fill = High)) +
  geom_bar(position = "fill") +
  scale_y_continuous(labels = scales::percent) +
  theme_minimal() +
  labs(
    title = "Proportion of Countries Above Median by Indicator",
    x = "Health Indicator",
    y = "Proportion of Countries",
    fill = "Above Median"
  )

Interpretation: Proportion of Countries Above the Median by Indicator

This figure illustrates the proportion of countries classified as above the median for each health indicator after binarization. As expected from a median-based threshold, the distribution is approximately balanced for alcohol consumption, BMI, and depression, with close to half of the countries falling into the high-risk category. Cholesterol shows a slightly smaller proportion of countries above the median, indicating a mildly left-skewed distribution relative to the other indicators. The absence of extreme imbalance across indicators confirms that the binarization strategy is appropriate for association rule mining, as it avoids dominance by any single variable. Overall, this plot validates the preprocessing step and ensures that subsequent association rules are driven by meaningful co-occurrence patterns rather than marginal frequency effects.

Convert to transactions

library(arules)

trans_data <- health_binary %>%
  select(-Location)

transactions <- as(trans_data, "transactions")

summary(transactions)
## transactions as itemMatrix in sparse format with
##  1629 rows (elements/itemsets/transactions) and
##  4 columns (items) and a density of 0.4774401 
## 
## most frequent items:
##         High_BMI     High_Alcohol  High_Depression High_Cholesterol 
##              813              801              792              705 
##          (Other) 
##                0 
## 
## element (itemset/transaction) length distribution:
## sizes
##   0   1   2   3   4 
## 369 344 283 331 302 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    2.00    1.91    3.00    4.00 
## 
## includes extended item information - examples:
##             labels        variables levels
## 1         High_BMI         High_BMI   TRUE
## 2 High_Cholesterol High_Cholesterol   TRUE
## 3  High_Depression  High_Depression   TRUE
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

To enable association rule mining, the binarized health indicator data were converted into a transaction format. The Location identifier was first removed, leaving only the binary variables representing whether each health indicator was above the median. This binary matrix was then coerced into a transactions object using the arules package. In this representation, each country corresponds to a single transaction, and each “high” health indicator corresponds to an item present in that transaction. A summary of the resulting transactions object was generated to inspect the number of transactions, items, and overall data sparsity prior to rule mining.

library(arules)

itemFrequencyPlot(
  transactions,
  topN = 10,
  type = "relative",
  col = "steelblue",
  main = "Relative Frequency of High-Risk Health Indicators"
)

Country-Level Composition of High-Risk Health Indicators (2015)

This stacked bar chart presents the composition of high-risk health indicators for selected countries in 2015. Each bar represents a country, while the colored segments indicate individual health risks—high BMI, high cholesterol, high depression prevalence, and high alcohol consumption—classified relative to the median. The total height of each bar reflects the number of concurrent risk factors, ranging from zero to four, while the color composition reveals which specific indicators contribute to a country’s overall risk profile. This visualization enhances interpretability by making co-occurring health risks explicit at the country level and provides an intuitive link between descriptive analysis and subsequent association rule mining.

country_long <- health_binary %>%
  select(Location, High_BMI, High_Cholesterol, High_Depression, High_Alcohol) %>%
  pivot_longer(
    cols = starts_with("High_"),
    names_to = "Indicator",
    values_to = "High"
  ) %>%
  filter(High == TRUE)
top_countries <- country_long %>%
  count(Location) %>%
  arrange(desc(n)) %>%
  slice_head(n = 15) %>%
  pull(Location)

country_long_top <- country_long %>%
  filter(Location %in% top_countries)
ggplot(country_long_top,
       aes(x = reorder(Location, Location, length),
           y = 1,
           fill = Indicator)) +
  geom_col(width = 0.7) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2",
                    labels = c("Alcohol",
                               "BMI",
                               "Cholesterol",
                               "Depression")) +
  theme_minimal(base_size = 11) +
  labs(
    title = "Country-Level Composition of High-Risk Health Indicators (2015)",
    x = "Country",
    y = "Number of High-Risk Indicators",
    fill = "Health Indicator"
  )

Inspecting a few transactions

inspect(head(transactions))
##     items transactionID
## [1] {}    1            
## [2] {}    2            
## [3] {}    3            
## [4] {}    4            
## [5] {}    5            
## [6] {}    6
#Apriori Algorithm

rules_apriori <- apriori(
  transactions,
  parameter = list(supp = 0.15, conf = 0.6)
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.6    0.1    1 none FALSE            TRUE       5    0.15      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 244 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[4 item(s), 1629 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [26 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
inspect(head(sort(rules_apriori, by = "lift")))
##     lhs                    rhs                  support confidence  coverage     lift count
## [1] {High_BMI,                                                                             
##      High_Cholesterol,                                                                     
##      High_Alcohol}      => {High_Depression}  0.1853898  0.8856305 0.2093309 1.821581   302
## [2] {High_BMI,                                                                             
##      High_Depression,                                                                      
##      High_Alcohol}      => {High_Cholesterol} 0.1853898  0.7803618 0.2375691 1.803134   302
## [3] {High_Depression,                                                                      
##      High_Alcohol}      => {High_Cholesterol} 0.2578269  0.7650273 0.3370166 1.767701   420
## [4] {High_BMI,                                                                             
##      High_Alcohol}      => {High_Depression}  0.2375691  0.8431373 0.2817680 1.734180   387
## [5] {High_BMI,                                                                             
##      High_Alcohol}      => {High_Cholesterol} 0.2093309  0.7429194 0.2817680 1.716618   341
## [6] {High_Cholesterol,                                                                     
##      High_Alcohol}      => {High_Depression}  0.2578269  0.8235294 0.3130755 1.693850   420

Interpretation

Apriori identifies directional association rules. The strongest rules indicate that countries with high BMI and high alcohol consumption are highly likely to exhibit elevated depression prevalence.

ECLAT

freq_eclat <- eclat(
transactions,
parameter = list(supp = 0.15)
)
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.15      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 244 
## 
## create itemset ... 
## set transactions ...[4 item(s), 1629 transaction(s)] done [0.00s].
## sorting and recoding items ... [4 item(s)] done [0.00s].
## creating bit matrix ... [4 row(s), 1629 column(s)] done [0.00s].
## writing  ... [15 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
inspect(head(freq_eclat))
##     items                 support count
## [1] {High_BMI,                         
##      High_Cholesterol,                 
##      High_Depression,                  
##      High_Alcohol}      0.1853898   302
## [2] {High_Cholesterol,                 
##      High_Depression,                  
##      High_Alcohol}      0.2578269   420
## [3] {High_BMI,                         
##      High_Cholesterol,                 
##      High_Alcohol}      0.2093309   341
## [4] {High_BMI,                         
##      High_Cholesterol,                 
##      High_Depression}   0.2400246   391
## [5] {High_Cholesterol,                 
##      High_Depression}   0.3204420   522
## [6] {High_BMI,                         
##      High_Cholesterol}  0.3026397   493
class(freq_eclat)
## [1] "itemsets"
## attr(,"package")
## [1] "arules"
length(freq_eclat)
## [1] 15
inspect(freq_eclat)
##      items                 support count
## [1]  {High_BMI,                         
##       High_Cholesterol,                 
##       High_Depression,                  
##       High_Alcohol}      0.1853898   302
## [2]  {High_Cholesterol,                 
##       High_Depression,                  
##       High_Alcohol}      0.2578269   420
## [3]  {High_BMI,                         
##       High_Cholesterol,                 
##       High_Alcohol}      0.2093309   341
## [4]  {High_BMI,                         
##       High_Cholesterol,                 
##       High_Depression}   0.2400246   391
## [5]  {High_Cholesterol,                 
##       High_Depression}   0.3204420   522
## [6]  {High_BMI,                         
##       High_Cholesterol}  0.3026397   493
## [7]  {High_Cholesterol,                 
##       High_Alcohol}      0.3130755   510
## [8]  {High_BMI,                         
##       High_Depression,                  
##       High_Alcohol}      0.2375691   387
## [9]  {High_Depression,                  
##       High_Alcohol}      0.3370166   549
## [10] {High_BMI,                         
##       High_Alcohol}      0.2817680   459
## [11] {High_BMI,                         
##       High_Depression}   0.3406998   555
## [12] {High_Depression}   0.4861878   792
## [13] {High_BMI}          0.4990792   813
## [14] {High_Alcohol}      0.4917127   801
## [15] {High_Cholesterol}  0.4327808   705

Interpretation

ECLAT identifies frequent itemsets representing clusters of co-occurring health risks. The results reveal multi-risk patterns involving metabolic, behavioral, and mental health indicators.

FP-Growth–Style Pattern Mining

Due to implementation constraints in the arules package, FP-Growth was represented through a frequent-pattern mining and rule-induction pipeline.

rules_fpgrowth <- ruleInduction(
freq_eclat,
transactions,
confidence = 0.6
)

inspect(head(sort(rules_fpgrowth, by = "lift")))
##     lhs                    rhs                  support confidence     lift itemset
## [1] {High_BMI,                                                                     
##      High_Cholesterol,                                                             
##      High_Alcohol}      => {High_Depression}  0.1853898  0.8856305 1.821581       1
## [2] {High_BMI,                                                                     
##      High_Depression,                                                              
##      High_Alcohol}      => {High_Cholesterol} 0.1853898  0.7803618 1.803134       1
## [3] {High_Depression,                                                              
##      High_Alcohol}      => {High_Cholesterol} 0.2578269  0.7650273 1.767701       2
## [4] {High_BMI,                                                                     
##      High_Alcohol}      => {High_Depression}  0.2375691  0.8431373 1.734180       8
## [5] {High_BMI,                                                                     
##      High_Alcohol}      => {High_Cholesterol} 0.2093309  0.7429194 1.716618       3
## [6] {High_Cholesterol,                                                             
##      High_Alcohol}      => {High_Depression}  0.2578269  0.8235294 1.693850       2

Interpretation

The FP-Growth–style approach produces rules of comparable quality to Apriori while avoiding explicit candidate generation, offering improved computational efficiency. # ECLAT

Performance Comparison

comparison <- data.frame(
Algorithm = c("Apriori", "FP-Growth-style", "ECLAT"),
Output_Size = c(
length(rules_apriori),
length(rules_fpgrowth),
length(freq_eclat)
)
)

comparison
##         Algorithm Output_Size
## 1         Apriori          26
## 2 FP-Growth-style          26
## 3           ECLAT          15

Apriori and FP-Growth–style approaches generated the same number of rules, while ECLAT produced fewer but denser frequent itemsets.

Discussion

The results demonstrate strong clustering of health risk factors across countries. Alcohol consumption and BMI consistently appear as central components in both rule-based and itemset-based analyses. The agreement between Apriori and FP-Growth–style rules suggests robustness of the identified patterns.

Conclusion

This study shows that association rule mining can effectively uncover complex co-occurrence patterns among health risk factors. While Apriori provides the most interpretable rules, the FP-Growth–style approach offers better computational efficiency. ECLAT complements both by identifying dense multi-risk clusters. Overall, FP-Growth–style pattern mining achieves the best balance between efficiency and rule quality for this dataset.