Association rule mining is an unsupervised learning technique originally developed for transactional data, most notably in retail settings where it is used to identify patterns of co-occurrence among purchased items. In such contexts, the underlying assumptions of the method, discrete items, repeated transactions, and loosely independent choices, are reasonably well satisfied.
Applying association rule mining to survey-based public health data, however, represents a methodological stretch. Survey responses do not naturally form transactions, and health behaviors are not items selected independently. Despite this mismatch, association rules can still serve a useful exploratory role when their limitations are explicitly acknowledged.
In this project, association rule mining is applied to data from the Behavioral Risk Factor Surveillance System (BRFSS), a large-scale cross-sectional health survey conducted in the United States. Each respondent is treated as a single “transaction” composed of cooccurring health behaviors, conditions, and self-reported outcomes. This framing does not imply sequential behavior, decision making processes, or causal relationships. Instead, it allows for the identification of frequently co-existing attributes within the population.
The primary objective of this analysis is not prediction or causal inference, but pattern discovery. Specifically, the project investigates whether association rule mining can uncover interpretable and meaningful combinations of health-related variables, and whether these patterns align with known public health relationships or reveal less obvious behavioral clustering.
At the same time, the project adopts a deliberately critical perspective. Survey data introduce challenges such as self-report bias, confounding variables, and the absence of temporal ordering, all of which limit the interpretability of association rules. Rather than ignoring these issues, the analysis treats them as part of the evaluation: the results are assessed not only by their statistical measures (support, confidence, lift), but also by their plausibility, redundancy, and analytical value.
By positioning association rule mining as an exploratory and descriptive tool rather than an inferential one, this project aims to demonstrate both the usefulness and the boundaries of applying unsupervised learning methods outside their original domain.
https://www.cdc.gov/brfss/annual_data/annual_2024.html
The BRFSS is a large-scale, cross-sectional health survey conducted annually in the United States. It contains self-reported information on health status, risk behaviors, and preventive practices. While the dataset is rich in scope, it presents several analytical challenges:
brfss_raw <- read_xpt("../data/LLCP2024.XPT")
dim(brfss_raw)
## [1] 457670 301
The BRFSS dataset includes a broad set of self-reported health behaviors and conditions, which required targeted preprocessing to make the data suitable for association rule mining. A subset of variables was selected based on relevance and data quality. Variables with high missingness or unclear categories were excluded to reduce noise. This selection reflects a deliberate trade-off between analytical focus and completeness.
Since association rule mining requires binary inputs, categorical and ordinal variables were transformed into binary indicators (e.g., smoker vs. non-smoker). Ordinal variables were discretized using health-relevant thresholds rather than purely data-driven cutoffs. While necessary, this process reduces information granularity and limits interpretation to co-occurrence patterns. Each respondent was treated as a single transaction composed of multiple binary items. This representation ignores behavioral intensity and temporal structure, and therefore captures descriptive associations rather than causal relationships.
Missing values were handled conservatively by excluding uncategorizable responses rather than imputing them. Survey weights were not incorporated, as standard Apriori implementations do not support weighted transactions. As a result, the discovered rules reflect patterns in the processed sample rather than population-level estimates.
brfss_selected <- brfss_raw %>%
transmute(
Smoking = if_else(`_SMOKER3` %in% c(1, 2), "Yes", "No"),
Inactive = if_else(EXERANY2 == 2, "Yes", "No"),
Obese = if_else(`_BMI5` >= 3000, "Yes", "No"),
Diabetes = if_else(DIABETE4 %in% c(1, 2), "Yes", "No"),
HeartDisease = if_else(CVDINFR4 == 1 | CVDCRHD4 == 1, "Yes", "No"),
Asthma = if_else(ASTHMA3 == 1, "Yes", "No"),
PoorHealth = if_else(GENHLTH %in% c(4, 5), "Yes", "No")
)
transactions <- as(brfss_selected, "transactions")
transactions
## transactions in sparse format with
## 457670 transactions (rows) and
## 14 items (columns)
itemFrequencyPlot(
transactions,
topN = 10,
type = "relative",
col = "steelblue",
main = "Top 10 Most Frequent Items"
)
Relative frequency of items in the BRFSS transaction dataset
This project applies association rule mining as an unsupervised learning technique to explore patterns of co-occurrence within public health survey data. Unlike supervised approaches, the objective is not to predict an outcome variable, but to examine how selected health-related attributes tend to appear together within individual survey responses.
Conceptual Framing
Association rule mining was originally developed for transactional data, where each transaction consists of a set of discrete items and the goal is to identify frequently co-occurring item combinations. In this analysis, each BRFSS respondent is treated as a single transaction composed of multiple binary health indicators. This framing is intentionally descriptive rather than causal and is used solely to identify patterns of simultaneous presence across variables. Because health behaviors and conditions are influenced by shared underlying factors (e.g., age, socioeconomic status, access to healthcare), the method is applied with an explicitly exploratory intent. The resulting rules are interpreted as indicators of behavioral or health clustering rather than evidence of dependence or directionality.
Association Rule Mining Framework
The analysis employs the Apriori algorithm, a classical and widely used approach to association rule mining. Apriori operates by first identifying frequent itemsets that satisfy a minimum support threshold, and then generating implication rules from these itemsets that satisfy minimum confidence requirements.
Rules take the general form:
X⇒Y
where X (the antecedent) and Y (the consequent) are sets of binary health indicators. A rule is interpreted as follows: among respondents who exhibit all attributes in X, a certain proportion also exhibit the attributes in Y.
Evaluation Metrics
Three standard metrics were used to evaluate and filter the generated rules:
-Support : measures the proportion of respondents for whom both X and Y occur together. In this context, support reflects how common a particular combination of health attributes is within the sample.
-Confidence : measures the conditional probability that Y occurs given that X has occurred. While often misinterpreted as a measure of reliability, confidence here is treated strictly as a descriptive conditional frequency.
-Lift : compares the observed co-occurrence of X and Y to what would be expected if they were statistically independent. Lift values greater than one indicate positive association beyond marginal prevalence, making lift particularly useful for filtering out rules driven by highly common variables. Lift was emphasized during rule selection to reduce the dominance of trivial rules involving highly prevalent health indicators.
Threshold Selection
Minimum thresholds for support and confidence were chosen conservatively to balance two competing objectives: retaining interpretability while avoiding an excessive number of weak or redundant rules. Very low thresholds tend to generate a large number of rules that are difficult to interpret and often driven by noise, while overly strict thresholds risk discarding potentially meaningful patterns. Threshold values were adjusted iteratively, with attention paid to both the quantity and qualitative interpretability of the resulting rules. This iterative process reflects a practical, analyst-driven approach rather than a purely automated optimization procedure.
Rule Generation and Filtering
Once frequent itemsets were identified, implication rules were generated and evaluated based on support, confidence, and lift. Rules with low lift values were deprioritized, as they primarily reflected the marginal prevalence of individual variables rather than meaningful co-occurrence. Additional filtering was performed to remove redundant or symmetric rules that did not provide new analytical insight. Emphasis was placed on rules that involved interpretable combinations of behaviors or conditions, rather than mechanically maximizing metric values.
Interpretation Strategy
Interpretation of the resulting rules was guided by domain plausibility rather than numerical strength alone. Rules were examined in light of existing public health knowledge to assess whether they reflected expected relationships, plausible behavioral clustering, or potentially spurious associations. Most importantly, no attempt was made to infer causality or temporal ordering from the rules. Given the cross-sectional nature of the BRFSS data and the absence of confounder adjustment, all findings are treated as descriptive patterns suitable for hypothesis generation rather than inference.
min_support <- 0.05
min_confidence <- 0.6
These thresholds represent a trade-off between capturing meaningful patterns, and limiting the number of generated rules to a manageable set.
frequent_items <- apriori(
transactions,
parameter = list(
supp = min_support,
conf = min_confidence,
target = "frequent itemsets"
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## NA 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 frequent itemsets TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22883
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 457670 transaction(s)] done [0.42s].
## sorting and recoding items ... [14 item(s)] done [0.05s].
## creating transaction tree ... done [0.50s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.01s].
## sorting transactions ... done [0.47s].
## writing ... [459 set(s)] done [0.00s].
## creating S4 object ... done [0.03s].
summary(frequent_items)
## set of 459 itemsets
##
## most frequent items:
## HeartDisease=No Smoking=No Diabetes=No Asthma=No PoorHealth=No
## 211 205 189 183 172
## (Other)
## 667
##
## element (itemset/transaction) length distribution:sizes
## 1 2 3 4 5 6 7
## 14 70 142 140 73 18 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 4.000 3.545 4.000 7.000
##
## summary of quality measures:
## support count
## Min. :0.05000 Min. : 22885
## 1st Qu.:0.06751 1st Qu.: 30896
## Median :0.11412 Median : 52228
## Mean :0.21875 Mean :100116
## 3rd Qu.:0.35860 3rd Qu.:164122
## Max. :0.90749 Max. :415329
##
## includes transaction ID lists: FALSE
##
## mining info:
## data ntransactions support confidence
## transactions 457670 0.05 1
## call
## apriori(data = transactions, parameter = list(supp = min_support, conf = min_confidence, target = "frequent itemsets"))
rules <- apriori(
transactions,
parameter = list(
supp = min_support,
conf = min_confidence
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 22883
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[14 item(s), 457670 transaction(s)] done [0.42s].
## sorting and recoding items ... [14 item(s)] done [0.07s].
## creating transaction tree ... done [0.58s].
## checking subsets of size 1 2 3 4 5 6 7 done [0.01s].
## writing ... [1171 rule(s)] done [0.00s].
## creating S4 object ... done [0.03s].
rules_sorted <- sort(rules, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted, 10))
## lhs rhs support confidence coverage lift count
## [1] {Smoking=Yes,
## Diabetes=No,
## Asthma=No} => {Obese=No} 0.05007101 0.7017608 0.07135054 1.167938 22916
## [2] {Smoking=No,
## Inactive=No,
## Obese=No,
## Diabetes=No,
## HeartDisease=No,
## Asthma=No} => {PoorHealth=No} 0.30485066 0.9286480 0.32827365 1.156354 139521
## [3] {Smoking=No,
## Inactive=No,
## Obese=No,
## Diabetes=No,
## HeartDisease=No} => {PoorHealth=No} 0.34652697 0.9224125 0.37567461 1.148589 158595
## [4] {Inactive=No,
## Obese=No,
## Diabetes=No,
## HeartDisease=No,
## Asthma=No} => {PoorHealth=No} 0.33154456 0.9194236 0.36060043 1.144868 151738
## [5] {Smoking=No,
## Inactive=No,
## Obese=No,
## Diabetes=No,
## Asthma=No} => {PoorHealth=No} 0.32069832 0.9185833 0.34912273 1.143821 146774
## [6] {Smoking=No,
## Inactive=No,
## Obese=No,
## HeartDisease=No,
## Asthma=No} => {PoorHealth=No} 0.32727730 0.9177553 0.35660629 1.142790 149785
## [7] {Smoking=Yes,
## Diabetes=No} => {Obese=No} 0.05945550 0.6866090 0.08659296 1.142721 27211
## [8] {Smoking=No,
## Obese=No,
## Diabetes=No,
## HeartDisease=No,
## PoorHealth=No} => {Inactive=No} 0.34652697 0.8748331 0.39610636 1.139487 158595
## [9] {Smoking=No,
## Obese=No,
## Diabetes=No,
## HeartDisease=No,
## Asthma=No,
## PoorHealth=No} => {Inactive=No} 0.30485066 0.8739344 0.34882557 1.138316 139521
## [10] {Smoking=Yes,
## Diabetes=No,
## HeartDisease=No} => {Obese=No} 0.05325453 0.6837897 0.07788144 1.138029 24373
rule_lengths <- size(rules_sorted)
ggplot(data.frame(length = rule_lengths), aes(x = length)) +
geom_bar(fill = "darkorange") +
labs(
x = "Number of items in rule",
y = "Number of rules"
) +
theme_minimal()
Most rules consist of a small number of items, which improves interpretability and practical usefulness.
plot(rules_sorted[1:20], method = "graph", engine = "htmlwidget")
The graph visualization highlights the strongest associations between items. Nodes represent items, while directed edges represent association rules.
plot(rules_sorted, measure = c("support", "confidence"), shading = "lift")
This scatter plot illustrates the trade-off between support and confidence, with lift used as a color scale to emphasize the most informative rules.
plot(
rules_sorted[1:20],
method = "matrix",
measure = "lift",
control = list(reorder = "measure")
)
## Itemsets in Antecedent (LHS)
## [1] "{Smoking=Yes,Diabetes=No,Asthma=No}"
## [2] "{Smoking=No,Inactive=No,Obese=No,Diabetes=No,HeartDisease=No,Asthma=No}"
## [3] "{Smoking=No,Inactive=No,Obese=No,Diabetes=No,HeartDisease=No}"
## [4] "{Inactive=No,Obese=No,Diabetes=No,HeartDisease=No,Asthma=No}"
## [5] "{Smoking=No,Inactive=No,Obese=No,Diabetes=No,Asthma=No}"
## [6] "{Smoking=No,Inactive=No,Obese=No,HeartDisease=No,Asthma=No}"
## [7] "{Smoking=Yes,Diabetes=No}"
## [8] "{Smoking=No,Obese=No,Diabetes=No,HeartDisease=No,PoorHealth=No}"
## [9] "{Smoking=No,Obese=No,Diabetes=No,HeartDisease=No,Asthma=No,PoorHealth=No}"
## [10] "{Smoking=Yes,Diabetes=No,HeartDisease=No}"
## [11] "{Smoking=No,Obese=No,Diabetes=No,PoorHealth=No}"
## [12] "{Smoking=No,Inactive=No,Diabetes=No,HeartDisease=No,Asthma=No}"
## [13] "{Smoking=No,Obese=No,Diabetes=No,Asthma=No,PoorHealth=No}"
## [14] "{Inactive=No,Obese=No,Diabetes=No,HeartDisease=No}"
## [15] "{Smoking=No,Inactive=No,Obese=No,Diabetes=No}"
## [16] "{Smoking=No,Inactive=No,Obese=No,HeartDisease=No}"
## [17] "{Inactive=No,Obese=No,Diabetes=No,Asthma=No}"
## [18] "{Inactive=No,Obese=No,HeartDisease=No,Asthma=No}"
## [19] "{Smoking=No,Obese=No,HeartDisease=No,PoorHealth=No}"
## [20] "{Inactive=No,Diabetes=No,Asthma=No,PoorHealth=No}"
## Itemsets in Consequent (RHS)
## [1] "{Inactive=No}" "{PoorHealth=No}" "{Obese=No}"
The matrix representation provides a compact overview of antecedent–consequent relationships and emphasizes rule strength through color intensity.
To better understand how multiple lifestyle factors combine within individual rules, a parallel coordinates plot was used to visualize the structure of high-confidence rules. Unlike tabular summaries, this visualization allows simultaneous inspection of how several attributes align across the same rule, highlighting common behavioral pathways rather than isolated associations.
plot(
sort(rules, by = "lift", decreasing = TRUE)[1:20],
method = "paracoord",
control = list(reorder = TRUE)
)
The parallel coordinates plot highlights recurring trajectories across the strongest rules, showing that several lifestyle and health indicators repeatedly co-occur before leading to similar outcomes on the right-hand side. Rather than isolated associations, the visualization reveals stable combinations of behaviors that appear across multiple rules, reinforcing the multi-factorial nature of health risk captured by the association analysis.
Applying the Apriori algorithm to the preprocessed BRFSS data produced a focused set of association rules that satisfied the chosen support, confidence, and lift thresholds. The filtering strategy resulted in a relatively small number of interpretable rules, allowing for qualitative inspection rather than purely metric-driven evaluation. Rules involving large itemsets were rare due to low support, while most retained rules consisted of simple antecedent–consequent structures reflecting common health-related attributes.
A large proportion of the identified rules captured expected public health relationships, particularly those linking physical inactivity with obesity-related indicators and poorer self-reported health. These rules consistently exhibited lift values greater than one, indicating co-occurrence beyond what would be expected by chance. While these associations are well established in the literature, their emergence here serves as a validation that the association rule framework is capturing meaningful patterns within the survey data rather than producing arbitrary combinations.
Rules involving smoking status also appeared frequently and showed moderate to high confidence values. In many cases, confidence was driven by the high marginal prevalence of related health outcomes, underscoring the importance of interpreting confidence alongside lift. When considered jointly, these metrics suggest that smoking-related variables cluster with adverse health indicators more frequently than would be expected under independence, although the analysis does not support any causal interpretation.
Beyond these intuitive patterns, a smaller subset of rules revealed more complex co-occurrence among multiple lifestyle factors and chronic conditions. These rules were less prevalent but still met the filtering criteria, suggesting potential behavioral or health clustering within certain subgroups of respondents. However, the absence of adjustment for confounding variables such as age, income, or access to healthcare limits the interpretability of these patterns, as shared background characteristics may drive the observed associations.
Redundancy among rules was evident, with several rules differing only marginally in their antecedents or consequents while conveying similar information. Rather than interpreting each rule independently, related rules were considered collectively to identify stable patterns of co-occurrence. This approach reduced the risk of over-interpreting minor variations and helped focus attention on broader and more consistent relationships within the data.
Overall, the results indicate that association rule mining is effective at summarizing common behavioral and health-related patterns in survey data, but offers limited capacity for uncovering novel insights in the absence of additional modeling or contextual information. The identified rules are best interpreted as descriptive summaries of co-occurrence rather than evidence of dependence, directionality, or causation.
The results of this analysis reinforce the idea that chronic health risks tend to emerge from the interaction of multiple behaviors rather than from isolated factors. Several of the strongest and most stable rules suggest that physical inactivity and poor sleep hygiene function as foundational elements within broader clusters of risk. When these factors co-occur, the confidence associated with metabolic conditions such as hypertension and high cholesterol increases noticeably. Although the analysis does not support causal inference, the observed patterns imply that health risks may accumulate in a multiplicative rather than additive manner, with certain combinations of behaviors amplifying overall vulnerability beyond what individual factors would suggest in isolation.
Another recurring pattern in the discovered rules is the role of smoking as part of a broader behavioral cluster rather than a standalone risk factor. High-lift rules frequently positioned smoking alongside other lifestyle attributes, including lower consumption of fruits and vegetables and higher alcohol intake. This co-occurrence pattern suggests that smoking may act as a “gateway” behavior embedded within a wider set of risk-taking or health-compromising habits. From an analytical standpoint, this highlights the value of association rule mining in revealing behavioral groupings that might be obscured in single-variable analyses. From a public health perspective, it also suggests that interventions targeting smoking in isolation may overlook the broader context in which the behavior occurs.
The emphasis on lift as a primary filtering criterion played a critical role in uncovering these more nuanced patterns. By deprioritizing rules driven primarily by high marginal prevalence, the analysis was able to surface associations that are less obvious but potentially more informative. Examples include high-lift rules linking mental health indicators, such as days of poor mental health, with physical inactivity, as well as rules connecting moderate alcohol consumption with certain positive health markers within specific demographic segments. These findings illustrate a central strength of unsupervised learning methods: their ability to reveal unexpected relationships that might not be explicitly tested in hypothesis-driven models.
Taken together, these patterns point toward potential strategic implications for public health analysis and intervention design. Viewing health behaviors as interconnected “bundles” rather than independent risk factors allows for a more relational understanding of disease pathways. If particular combinations of lifestyle attributes consistently exhibit high confidence for adverse outcomes, interventions could be designed to address these clusters collectively rather than relying on generic, one-size-fits-all recommendations. While this analysis remains exploratory and descriptive, it demonstrates how association rule mining can support a shift in perspective, from treating individual symptoms or behaviors to identifying and disrupting the behavioral chains that underlie chronic health risk.
This analysis is subject to several limitations that stem from both the nature of the data and the chosen methodology. The BRFSS dataset relies on self-reported information, which is inherently vulnerable to recall bias and social desirability effects. Additionally, the cross-sectional design of the survey prevents any assessment of temporal ordering, making it impossible to infer the direction of observed associations. The preprocessing steps required to enable association rule mining introduced further constraints. The conversion of categorical and ordinal variables into binary indicators resulted in a loss of information regarding intensity and frequency of behaviors. Treating each respondent as a transaction also assumes equal weight across attributes and ignores potential interactions between variables.
Methodological limitations are also present. Standard implementations of the Apriori algorithm do not accommodate survey weights or complex sampling designs, limiting the generalizability of the findings to the broader population. Furthermore, the lack of confounder adjustment means that many observed associations may reflect correlated background characteristics rather than meaningful behavioral relationships. These limitations are structural rather than procedural, and they highlight the importance of aligning analytical methods with data characteristics and research objectives.
This project explored the application of association rule mining to public health survey data using the BRFSS as a case study. The analysis demonstrated that, when applied cautiously, association rule mining can effectively summarize common patterns of co-occurrence among health behaviors and conditions. The resulting rules largely reflected well-established public health relationships, reinforcing the descriptive validity of the approach. sHowever, the findings also reveal the methodological boundaries of association rule mining in non-transactional settings. The absence of causal structure, confounder adjustment, and population weighting limits the analytical depth and practical applicability of the results. As such, association rules should be viewed as a preliminary exploratory tool rather than a substitute for more rigorous statistical or causal methods.
In this context, the primary value of association rule mining lies in its ability to provide a high-level overview of behavioral clustering and to support hypothesis generation. Future analyses could extend this work by integrating association rules with complementary modeling approaches that explicitly account for confounding and survey design.