Introduction

In this paper, I use a drug dataset that includes information about medical conditions, drug classes, prescription types, and reported side effects. Since many of these variables are categorical and may appear together, it is not always clear which combinations are systematically related when looking at them individually.

For this reason, I use association rule mining to explore how medical conditions, drug categories, and side effects co-occur in the data. This method allows me to identify frequent patterns without testing predefined hypotheses. More specifically, I apply the Apriori algorithm to discover rules based on support, confidence, and lift measures.

The goal of this analysis is to uncover meaningful association patterns between medical conditions, drug classes, and adverse effects.

Packages

library(tidyverse)
library(arules)
library(arulesViz)
library(stringr)

Data Import

data <- read.csv("drugs_side_effects_drugs_com.csv", stringsAsFactors = FALSE)

data %>% 
  select(drug_name, medical_condition, drug_classes, rx_otc) %>% 
  slice_head(n = 6)
##        drug_name medical_condition
## 1    doxycycline              Acne
## 2 spironolactone              Acne
## 3    minocycline              Acne
## 4       Accutane              Acne
## 5    clindamycin              Acne
## 6      Aldactone              Acne
##                                                        drug_classes rx_otc
## 1                        Miscellaneous antimalarials, Tetracyclines     Rx
## 2     Aldosterone receptor antagonists, Potassium-sparing diuretics     Rx
## 3                                                     Tetracyclines     Rx
## 4 Miscellaneous antineoplastics, Miscellaneous uncategorized agents     Rx
## 5                      Topical acne agents, Vaginal anti-infectives     Rx
## 6     Aldosterone receptor antagonists, Potassium-sparing diuretics     Rx

Transaction Data Preparation

In this part, each drug entry will be converted into a set of categorical items including medical condition, drug class and reported side effects.

data_sub <- data %>%
  select(medical_condition, drug_classes, side_effects)

Creating Items for Apriori Algorithm

Side effects are free text, so I convert them into a set of standard side effect tags. Drug classes and medical conditions are kept as categorical items.

Creating Item Lists

side_tags <- c(
  "hives", "itching", "swelling", "rash", "fever",
  "nausea", "vomiting", "diarrhea", "dizziness", "headache",
  "short of breath", "chest pain", "stomach pain", "jaundice",
  "tiredness", "weakness"
)

Converting to Transaction Format

In this section, I turn each drug into a transaction containing one condition (COND_), its drug classes (CLASS_), and a small set of side-effect tags (SE_) extracted from text. This creates a consistent item format for Apriori algorithm.

df_items <- data_sub %>%
  mutate(medical_item = paste0("COND_", str_replace_all(tolower(medical_condition), "\\s+", "_"))) %>%
  rowwise() %>%
  mutate(
    class_items = list({
      x <- str_split(tolower(coalesce(drug_classes, "")), ",\\s*")[[1]]
      x <- str_replace_all(x, "\\s+", "_")
      x <- x[x != ""]
      paste0("CLASS_", x)
    }),
    side_items = list({
      se_txt <- tolower(coalesce(side_effects, ""))
      hits <- side_tags[str_detect(se_txt, fixed(side_tags, ignore_case = TRUE))]
      paste0("SE_", str_replace_all(hits, "\\s+", "_"))
    }),
    items = list(c(medical_item, class_items, side_items))
  ) %>%
  ungroup() %>%
  select(items)

trans <- as(df_items$items, "transactions")

idx <- which(itemLabels(trans) == "CLASS_")
if (length(idx) > 0) trans <- trans[, -idx]

summary(trans)
## transactions as itemMatrix in sparse format with
##  2931 rows (elements/itemsets/transactions) and
##  307 columns (items) and a density of 0.03118634 
## 
## most frequent items:
##  SE_swelling     SE_hives    SE_nausea SE_dizziness  SE_headache      (Other) 
##         2597         2239         1662         1516         1513        18535 
## 
## element (itemset/transaction) length distribution:
## sizes
##   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
##  18 135 103 168 185 320 279 244 297 226 281 201 193 101 137  37   2   4 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   7.000  10.000   9.574  12.000  19.000 
## 
## includes extended item information - examples:
##                                                      labels
## 2                                  CLASS_5-aminosalicylates
## 3 CLASS_ace_inhibitors_with_calcium_channel_blocking_agents
## 4                       CLASS_ace_inhibitors_with_thiazides

Exploratory Analysis

Before applying Apriori, I examine the most frequent items to understand the general structure of the transactions.

itemFrequencyPlot(trans, topN = 20, type = "absolute")

Apriori Algorithm

In this part, I apply the Apriori algorithm to identify association rules between medical conditions, drug classes, and reported side effects. Each transaction includes one medical condition (COND_), one or more drug classes (CLASS_), and detected side effect tags (SE_).

rules <- apriori(
  trans,
  parameter = list(supp = 0.02, conf = 0.4, minlen = 3, maxlen = 4),
  appearance = list(
    lhs = c(grep("^COND_", itemLabels(trans), value = TRUE),
            grep("^CLASS_", itemLabels(trans), value = TRUE)),
    rhs = grep("^SE_", itemLabels(trans), value = TRUE),
    default = "none"
  ),
  control = list(verbose = FALSE)
)

rules_lift <- sort(rules, by = "lift", decreasing = TRUE)
rules_conf <- sort(rules, by = "confidence", decreasing = TRUE)

summary(rules)
## set of 16 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 16 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support          confidence        coverage            lift       
##  Min.   :0.02013   Min.   :0.4363   Min.   :0.03071   Min.   :0.8734  
##  1st Qu.:0.02960   1st Qu.:0.6164   1st Qu.:0.03966   1st Qu.:1.0268  
##  Median :0.03872   Median :0.7289   Median :0.06960   Median :1.2037  
##  Mean   :0.03900   Mean   :0.7348   Mean   :0.05482   Mean   :1.2109  
##  3rd Qu.:0.04333   3rd Qu.:0.9020   3rd Qu.:0.06960   3rd Qu.:1.3140  
##  Max.   :0.06892   Max.   :0.9902   Max.   :0.06960   Max.   :1.8804  
##      count       
##  Min.   : 59.00  
##  1st Qu.: 86.75  
##  Median :113.50  
##  Mean   :114.31  
##  3rd Qu.:127.00  
##  Max.   :202.00  
## 
## mining info:
##   data ntransactions support confidence
##  trans          2931    0.02        0.4
##                                                                                                                                                                                                                                                                                                                             call
##  apriori(data = trans, parameter = list(supp = 0.02, conf = 0.4, minlen = 3, maxlen = 4), appearance = list(lhs = c(grep("^COND_", itemLabels(trans), value = TRUE), grep("^CLASS_", itemLabels(trans), value = TRUE)), rhs = grep("^SE_", itemLabels(trans), value = TRUE), default = "none"), control = list(verbose = FALSE))
inspect(head(rules_lift, 5))
##     lhs                                        rhs               support confidence   coverage     lift count
## [1] {CLASS_topical_acne_agents,                                                                              
##      COND_acne}                             => {SE_itching}   0.03923576  0.9200000 0.04264756 1.880418   115
## [2] {CLASS_topical_steroids,                                                                                 
##      COND_eczema}                           => {SE_itching}   0.02388263  0.7777778 0.03070624 1.589726    70
## [3] {CLASS_upper_respiratory_combinations,                                                                   
##      COND_colds_&_flu}                      => {SE_dizziness} 0.04810645  0.6911765 0.06960082 1.336305   141
## [4] {CLASS_upper_respiratory_combinations,                                                                   
##      COND_colds_&_flu}                      => {SE_rash}      0.04298874  0.6176471 0.06960082 1.318517   126
## [5] {CLASS_topical_steroids,                                                                                 
##      COND_eczema}                           => {SE_weakness}  0.02012965  0.6555556 0.03070624 1.312454    59

Rule Visualization

plot(head(rules_lift, 15), method = "graph", engine = "htmlwidget")

Rule Distribution Analysis

This plot shows how the discovered rules are distributed in terms of support and confidence. Darker points indicate higher lift values and stronger associations.

plot(rules, 
     method = "scatterplot", 
     measure = c("support", "confidence"), 
     shading = "lift")

Results and Interpretation

When I analyzed the strongest rules, I observed that acne treatments and topical steroids were strongly associated with itching. In addition, cold and flu drug combinations showed stronger than expected associations with side effects such as dizziness and rash. These patterns suggest non random relationships between certain drug classes and specific adverse effects.

Conclusion

In this paper, I applied association rule mining to explore relationships between medical conditions, drug classes, and reported side effects. The Apriori algorithm helped identify meaningful co-occurrence patterns in the data.

The results suggest that certain treatments are systematically linked to specific adverse effects. In conclusion, this analysis shows that association rule mining can be a useful tool for discovering structured patterns in healthcare-related datasets.