In this paper, I use a drug dataset that includes information about medical conditions, drug classes, prescription types, and reported side effects. Since many of these variables are categorical and may appear together, it is not always clear which combinations are systematically related when looking at them individually.
For this reason, I use association rule mining to explore how medical conditions, drug categories, and side effects co-occur in the data. This method allows me to identify frequent patterns without testing predefined hypotheses. More specifically, I apply the Apriori algorithm to discover rules based on support, confidence, and lift measures.
The goal of this analysis is to uncover meaningful association patterns between medical conditions, drug classes, and adverse effects.
library(tidyverse)
library(arules)
library(arulesViz)
library(stringr)
data <- read.csv("drugs_side_effects_drugs_com.csv", stringsAsFactors = FALSE)
data %>%
select(drug_name, medical_condition, drug_classes, rx_otc) %>%
slice_head(n = 6)
## drug_name medical_condition
## 1 doxycycline Acne
## 2 spironolactone Acne
## 3 minocycline Acne
## 4 Accutane Acne
## 5 clindamycin Acne
## 6 Aldactone Acne
## drug_classes rx_otc
## 1 Miscellaneous antimalarials, Tetracyclines Rx
## 2 Aldosterone receptor antagonists, Potassium-sparing diuretics Rx
## 3 Tetracyclines Rx
## 4 Miscellaneous antineoplastics, Miscellaneous uncategorized agents Rx
## 5 Topical acne agents, Vaginal anti-infectives Rx
## 6 Aldosterone receptor antagonists, Potassium-sparing diuretics Rx
In this part, each drug entry will be converted into a set of categorical items including medical condition, drug class and reported side effects.
data_sub <- data %>%
select(medical_condition, drug_classes, side_effects)
Side effects are free text, so I convert them into a set of standard side effect tags. Drug classes and medical conditions are kept as categorical items.
side_tags <- c(
"hives", "itching", "swelling", "rash", "fever",
"nausea", "vomiting", "diarrhea", "dizziness", "headache",
"short of breath", "chest pain", "stomach pain", "jaundice",
"tiredness", "weakness"
)
In this section, I turn each drug into a transaction containing one condition (COND_), its drug classes (CLASS_), and a small set of side-effect tags (SE_) extracted from text. This creates a consistent item format for Apriori algorithm.
df_items <- data_sub %>%
mutate(medical_item = paste0("COND_", str_replace_all(tolower(medical_condition), "\\s+", "_"))) %>%
rowwise() %>%
mutate(
class_items = list({
x <- str_split(tolower(coalesce(drug_classes, "")), ",\\s*")[[1]]
x <- str_replace_all(x, "\\s+", "_")
x <- x[x != ""]
paste0("CLASS_", x)
}),
side_items = list({
se_txt <- tolower(coalesce(side_effects, ""))
hits <- side_tags[str_detect(se_txt, fixed(side_tags, ignore_case = TRUE))]
paste0("SE_", str_replace_all(hits, "\\s+", "_"))
}),
items = list(c(medical_item, class_items, side_items))
) %>%
ungroup() %>%
select(items)
trans <- as(df_items$items, "transactions")
idx <- which(itemLabels(trans) == "CLASS_")
if (length(idx) > 0) trans <- trans[, -idx]
summary(trans)
## transactions as itemMatrix in sparse format with
## 2931 rows (elements/itemsets/transactions) and
## 307 columns (items) and a density of 0.03118634
##
## most frequent items:
## SE_swelling SE_hives SE_nausea SE_dizziness SE_headache (Other)
## 2597 2239 1662 1516 1513 18535
##
## element (itemset/transaction) length distribution:
## sizes
## 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 18 135 103 168 185 320 279 244 297 226 281 201 193 101 137 37 2 4
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 7.000 10.000 9.574 12.000 19.000
##
## includes extended item information - examples:
## labels
## 2 CLASS_5-aminosalicylates
## 3 CLASS_ace_inhibitors_with_calcium_channel_blocking_agents
## 4 CLASS_ace_inhibitors_with_thiazides
Before applying Apriori, I examine the most frequent items to understand the general structure of the transactions.
itemFrequencyPlot(trans, topN = 20, type = "absolute")
In this part, I apply the Apriori algorithm to identify association rules between medical conditions, drug classes, and reported side effects. Each transaction includes one medical condition (COND_), one or more drug classes (CLASS_), and detected side effect tags (SE_).
rules <- apriori(
trans,
parameter = list(supp = 0.02, conf = 0.4, minlen = 3, maxlen = 4),
appearance = list(
lhs = c(grep("^COND_", itemLabels(trans), value = TRUE),
grep("^CLASS_", itemLabels(trans), value = TRUE)),
rhs = grep("^SE_", itemLabels(trans), value = TRUE),
default = "none"
),
control = list(verbose = FALSE)
)
rules_lift <- sort(rules, by = "lift", decreasing = TRUE)
rules_conf <- sort(rules, by = "confidence", decreasing = TRUE)
summary(rules)
## set of 16 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 16
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.02013 Min. :0.4363 Min. :0.03071 Min. :0.8734
## 1st Qu.:0.02960 1st Qu.:0.6164 1st Qu.:0.03966 1st Qu.:1.0268
## Median :0.03872 Median :0.7289 Median :0.06960 Median :1.2037
## Mean :0.03900 Mean :0.7348 Mean :0.05482 Mean :1.2109
## 3rd Qu.:0.04333 3rd Qu.:0.9020 3rd Qu.:0.06960 3rd Qu.:1.3140
## Max. :0.06892 Max. :0.9902 Max. :0.06960 Max. :1.8804
## count
## Min. : 59.00
## 1st Qu.: 86.75
## Median :113.50
## Mean :114.31
## 3rd Qu.:127.00
## Max. :202.00
##
## mining info:
## data ntransactions support confidence
## trans 2931 0.02 0.4
## call
## apriori(data = trans, parameter = list(supp = 0.02, conf = 0.4, minlen = 3, maxlen = 4), appearance = list(lhs = c(grep("^COND_", itemLabels(trans), value = TRUE), grep("^CLASS_", itemLabels(trans), value = TRUE)), rhs = grep("^SE_", itemLabels(trans), value = TRUE), default = "none"), control = list(verbose = FALSE))
inspect(head(rules_lift, 5))
## lhs rhs support confidence coverage lift count
## [1] {CLASS_topical_acne_agents,
## COND_acne} => {SE_itching} 0.03923576 0.9200000 0.04264756 1.880418 115
## [2] {CLASS_topical_steroids,
## COND_eczema} => {SE_itching} 0.02388263 0.7777778 0.03070624 1.589726 70
## [3] {CLASS_upper_respiratory_combinations,
## COND_colds_&_flu} => {SE_dizziness} 0.04810645 0.6911765 0.06960082 1.336305 141
## [4] {CLASS_upper_respiratory_combinations,
## COND_colds_&_flu} => {SE_rash} 0.04298874 0.6176471 0.06960082 1.318517 126
## [5] {CLASS_topical_steroids,
## COND_eczema} => {SE_weakness} 0.02012965 0.6555556 0.03070624 1.312454 59
plot(head(rules_lift, 15), method = "graph", engine = "htmlwidget")
This plot shows how the discovered rules are distributed in terms of support and confidence. Darker points indicate higher lift values and stronger associations.
plot(rules,
method = "scatterplot",
measure = c("support", "confidence"),
shading = "lift")
When I analyzed the strongest rules, I observed that acne treatments and topical steroids were strongly associated with itching. In addition, cold and flu drug combinations showed stronger than expected associations with side effects such as dizziness and rash. These patterns suggest non random relationships between certain drug classes and specific adverse effects.
In this paper, I applied association rule mining to explore relationships between medical conditions, drug classes, and reported side effects. The Apriori algorithm helped identify meaningful co-occurrence patterns in the data.
The results suggest that certain treatments are systematically linked to specific adverse effects. In conclusion, this analysis shows that association rule mining can be a useful tool for discovering structured patterns in healthcare-related datasets.