1 Introduction

The objective of this project is to apply Association Rule Mining using the Apriori algorithm to the Drug Consumption Dataset from the UCI Machine Learning Repository. The goal is to uncover non-obvious and interpretable relationships between personality traits (Big Five, impulsivity, sensation seeking) and the use of psychoactive substances.

2 Dataset Description

The dataset contains 1,885 observations and 31 variables, including personality traits and self-reported drug consumption. Personality characteristics are represented using the Big Five model, with Nscore, Escore, Oscore, Ascore, and Cscore corresponding to Neuroticism, Extraversion, Openness to Experience, Agreeableness, and Conscientiousness, respectively. These traits capture stable individual differences in emotional, cognitive, and behavioral functioning.

All personality variables are provided as standardized Z-scores and were subsequently discretized into low, medium, and high categories to facilitate association rule mining. In addition to the Big Five traits, the dataset includes measures of impulsivity (Impulsive) and sensation seeking (SS), which reflect tendencies toward impulsive behavior and the pursuit of novel and intense experiences. Drug consumption variables are measured on a seven-level categorical scale (CL0–CL6), indicating increasing recency of use.

3 Data Loading

url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data"

col_names <- c(
  "ID","Age","Gender","Education","Country","Ethnicity",
  "Nscore","Escore","Oscore","Ascore","Cscore","Impulsive","SS",
  "Alcohol","Amphet","Amyl","Benzos","Caff","Cannabis","Choc","Coke",
  "Crack","Ecstasy","Heroin","Ketamine","Legalh","LSD","Meth",
  "Mushrooms","Nicotine","Semer","VSA"
)

df <- read.csv(url, header = FALSE, col.names = col_names)
df$ID <- NULL

dim(df)
## [1] 1885   31

4 Data Preprocessing

4.1 Discretization of Personality Traits

Continuous personality scores were discretized into tertiles (low, medium, high). This transformation allows Apriori to focus on distinct personality profiles and facilitates interpretation of resulting rules.

personality_traits <- c(
  "Nscore","Escore","Oscore",
  "Ascore","Cscore","Impulsive","SS"
)

for (col in personality_traits) {
  df[[col]] <- cut(
    df[[col]],
    breaks = quantile(df[[col]], probs = c(0, 1/3, 2/3, 1)),
    labels = c(
      paste0("Low_", col),
      paste0("Med_", col),
      paste0("High_", col)
    ),
    include.lowest = TRUE
  )
}

4.2 Binarization of Drug Use

Drug consumption variables were simplified into binary categories: non-user (CL0–CL2) and user (CL3–CL6), focusing the analysis on recent or active use.

target_substances <- c(
  "Cannabis","Ecstasy","LSD",
  "Mushrooms","Coke","Nicotine"
)

for (drug in target_substances) {
  df[[drug]] <- ifelse(
    df[[drug]] %in% c("CL3","CL4","CL5","CL6"),
    paste0("User_", drug),
    paste0("NonUser_", drug)
  )
}

5 Transaction Creation

final_df <- df[, c(personality_traits, target_substances)]
transactions <- as(final_df, "transactions")

summary(transactions)
## transactions as itemMatrix in sparse format with
##  1885 rows (elements/itemsets/transactions) and
##  33 columns (items) and a density of 0.3939394 
## 
## most frequent items:
##             LSD=NonUser_LSD           Coke=NonUser_Coke 
##                        1505                        1468 
## Mushrooms=NonUser_Mushrooms     Ecstasy=NonUser_Ecstasy 
##                        1451                        1368 
##      Nicotine=User_Nicotine                     (Other) 
##                        1060                       17653 
## 
## element (itemset/transaction) length distribution:
## sizes
##   13 
## 1885 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      13      13      13      13      13      13 
## 
## includes extended item information - examples:
##               labels variables      levels
## 1  Nscore=Low_Nscore    Nscore  Low_Nscore
## 2  Nscore=Med_Nscore    Nscore  Med_Nscore
## 3 Nscore=High_Nscore    Nscore High_Nscore
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

6 Association Rule Mining

The Apriori algorithm was applied using minimum support of 5% and minimum confidence of 35%, balancing rule strength and interpretability.

rules <- apriori(
  transactions,
  parameter = list(
    supp = 0.05,
    conf = 0.35,
    minlen = 2
  )
)
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.35    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 94 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[33 item(s), 1885 transaction(s)] done [0.00s].
## sorting and recoding items ... [33 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.06s].
## writing ... [48537 rule(s)] done [0.01s].
## creating S4 object  ... done [0.01s].
rules <- sort(rules, by = "lift")
summary(rules)
## set of 48537 rules
## 
## rule length distribution (lhs + rhs):sizes
##     2     3     4     5     6     7     8     9 
##   495  4685 11607 13815 10757  5379  1585   214 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   4.000   5.000   5.096   6.000   9.000 
## 
## summary of quality measures:
##     support          confidence        coverage           lift       
##  Min.   :0.05040   Min.   :0.3500   Min.   :0.0504   Min.   :0.4939  
##  1st Qu.:0.05782   1st Qu.:0.5291   1st Qu.:0.0748   1st Qu.:1.1780  
##  Median :0.06950   Median :0.7651   Median :0.1040   Median :1.2841  
##  Mean   :0.08434   Mean   :0.7338   Mean   :0.1275   Mean   :1.3731  
##  3rd Qu.:0.09390   3rd Qu.:0.9608   3rd Qu.:0.1523   3rd Qu.:1.5469  
##  Max.   :0.72042   Max.   :1.0000   Max.   :0.7984   Max.   :3.9824  
##      count     
##  Min.   :  95  
##  1st Qu.: 109  
##  Median : 131  
##  Mean   : 159  
##  3rd Qu.: 177  
##  Max.   :1358  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          1885    0.05       0.35
##                                                                                  call
##  apriori(data = transactions, parameter = list(supp = 0.05, conf = 0.35, minlen = 2))

6.1 Support vs Confidence

The scatter plot illustrates the relationship between support and confidence for all association rules, with color intensity indicating the lift measure.

plot(rules, measure = c("support", "confidence"), shading = "lift")

Most rules show low support but moderate to high confidence, suggesting reliable patterns within specific subgroups rather than the entire population. Rules with higher lift values are concentrated in this region, indicating that the most informative associations are not necessarily the most frequent ones.

6.2 Distribution of Lift Values

hist(quality(rules)$lift,
     breaks = 30,
     main = "Distribution of Lift Values",
     xlab = "Lift",
     col = "lightgray",
     border = "white")

Most rules exhibit lift values slightly above one, indicating weak to moderate positive associations between antecedents and consequents. A smaller subset of rules reaches substantially higher lift values, representingstrong and potentially meaningful deviations from statistical independence.

7 Rule Filtering and Redundancy Removal

To improve interpretability, rules were filtered to include only those where personality traits predict the use of a single substance. Redundant rules were removed.

rules_clean <- subset(
  rules,
  subset =
    rhs %pin% "User_" &
    size(rhs) == 1 &
    !(lhs %pin% "User_")
)

rules_clean <- rules_clean[!is.redundant(rules_clean)]
rules_top <- head(sort(rules_clean, by = "lift"), 10)

inspect(rules_top)
##      lhs                      rhs                           support confidence  coverage     lift count
## [1]  {Oscore=High_Oscore,                                                                              
##       SS=High_SS}          => {LSD=User_LSD}             0.06578249  0.4960000 0.1326260 2.460421   124
## [2]  {Oscore=High_Oscore,                                                                              
##       SS=High_SS}          => {Mushrooms=User_Mushrooms} 0.06896552  0.5200000 0.1326260 2.258525   130
## [3]  {Oscore=High_Oscore,                                                                              
##       Cscore=Low_Cscore}   => {LSD=User_LSD}             0.05411141  0.4322034 0.1251989 2.143956   102
## [4]  {Cscore=Low_Cscore,                                                                               
##       SS=High_SS}          => {LSD=User_LSD}             0.06100796  0.4307116 0.1416446 2.136556   115
## [5]  {Escore=High_Escore,                                                                              
##       Oscore=High_Oscore}  => {LSD=User_LSD}             0.05145889  0.4181034 0.1230769 2.074013    97
## [6]  {Escore=High_Escore,                                                                              
##       Oscore=High_Oscore}  => {Mushrooms=User_Mushrooms} 0.05782493  0.4698276 0.1230769 2.040611   109
## [7]  {Oscore=High_Oscore,                                                                              
##       Cscore=Low_Cscore}   => {Mushrooms=User_Mushrooms} 0.05835544  0.4661017 0.1251989 2.024428   110
## [8]  {Escore=High_Escore,                                                                              
##       SS=High_SS}          => {Ecstasy=User_Ecstasy}     0.05941645  0.5463415 0.1087533 1.991980   112
## [9]  {Cscore=Low_Cscore,                                                                               
##       SS=High_SS}          => {Coke=User_Coke}           0.06206897  0.4382022 0.1416446 1.980842   117
## [10] {Cscore=Low_Cscore,                                                                               
##       SS=High_SS}          => {Ecstasy=User_Ecstasy}     0.07692308  0.5430712 0.1416446 1.980056   145

8 Visualization of Association Rules

8.1 Network Graph

plot(
  rules_top,
  method = "graph",
  measure = "support",
  shading = "lift",
  main = "Association Rules: Personality Traits → Drug Use"
)
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

The network graph visualizes the strongest association rules by representing personality traits and substances as nodes, with edge direction indicating rule flow from antecedents to consequents. Node size reflects support, while color intensity corresponds to lift, highlighting particularly strong associations involving high sensation seeking and high openness to experience.

8.2 Grouped Matrix Visualization

The grouped matrix visualization summarizes the strongest association rules by organizing them according to shared antecedent itemsets on the left-hand side and their corresponding consequents on the right-hand side. Dot size represents support, while color intensity indicates lift.

plot(
  rules_top,
  method = "grouped",
  control = list(type = "items")
)
## Available control parameters (with default values):
## k     =  20
## aggr.fun  =  function (x, ...)  UseMethod("mean")
## rhs_max   =  10
## lhs_label_items   =  2
## col   =  c("#EE0000FF", "#EEEEEEFF")
## groups    =  NULL
## engine    =  ggplot2
## verbose   =  FALSE

8.3 Parallel Coordinates Plot

plot(
  rules_top,
  method = "paracoord",
  control = list(reorder = TRUE)
)

9 Discussion

The strongest rules consistently involve high sensation seeking and high openness to experience predicting the use of psychedelic and stimulant substances such as LSD, mushrooms, ecstasy, and cocaine. This pattern aligns with established psychological theories linking novelty-seeking traits to experimentation with psychoactive substances.

10 Conclusion

This project demonstrates that Association Rule Mining can successfully uncover meaningful, non-linear relationships between personality traits and drug consumption. The applied preprocessing and filtering steps ensured both methodological rigor and interpretability of results.