The objective of this project is to apply Association Rule Mining using the Apriori algorithm to the Drug Consumption Dataset from the UCI Machine Learning Repository. The goal is to uncover non-obvious and interpretable relationships between personality traits (Big Five, impulsivity, sensation seeking) and the use of psychoactive substances.
The dataset contains 1,885 observations and 31 variables, including personality traits and self-reported drug consumption. Personality characteristics are represented using the Big Five model, with Nscore, Escore, Oscore, Ascore, and Cscore corresponding to Neuroticism, Extraversion, Openness to Experience, Agreeableness, and Conscientiousness, respectively. These traits capture stable individual differences in emotional, cognitive, and behavioral functioning.
All personality variables are provided as standardized Z-scores and were subsequently discretized into low, medium, and high categories to facilitate association rule mining. In addition to the Big Five traits, the dataset includes measures of impulsivity (Impulsive) and sensation seeking (SS), which reflect tendencies toward impulsive behavior and the pursuit of novel and intense experiences. Drug consumption variables are measured on a seven-level categorical scale (CL0–CL6), indicating increasing recency of use.
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data"
col_names <- c(
"ID","Age","Gender","Education","Country","Ethnicity",
"Nscore","Escore","Oscore","Ascore","Cscore","Impulsive","SS",
"Alcohol","Amphet","Amyl","Benzos","Caff","Cannabis","Choc","Coke",
"Crack","Ecstasy","Heroin","Ketamine","Legalh","LSD","Meth",
"Mushrooms","Nicotine","Semer","VSA"
)
df <- read.csv(url, header = FALSE, col.names = col_names)
df$ID <- NULL
dim(df)
## [1] 1885 31
Continuous personality scores were discretized into tertiles (low, medium, high). This transformation allows Apriori to focus on distinct personality profiles and facilitates interpretation of resulting rules.
personality_traits <- c(
"Nscore","Escore","Oscore",
"Ascore","Cscore","Impulsive","SS"
)
for (col in personality_traits) {
df[[col]] <- cut(
df[[col]],
breaks = quantile(df[[col]], probs = c(0, 1/3, 2/3, 1)),
labels = c(
paste0("Low_", col),
paste0("Med_", col),
paste0("High_", col)
),
include.lowest = TRUE
)
}
Drug consumption variables were simplified into binary categories: non-user (CL0–CL2) and user (CL3–CL6), focusing the analysis on recent or active use.
target_substances <- c(
"Cannabis","Ecstasy","LSD",
"Mushrooms","Coke","Nicotine"
)
for (drug in target_substances) {
df[[drug]] <- ifelse(
df[[drug]] %in% c("CL3","CL4","CL5","CL6"),
paste0("User_", drug),
paste0("NonUser_", drug)
)
}
final_df <- df[, c(personality_traits, target_substances)]
transactions <- as(final_df, "transactions")
summary(transactions)
## transactions as itemMatrix in sparse format with
## 1885 rows (elements/itemsets/transactions) and
## 33 columns (items) and a density of 0.3939394
##
## most frequent items:
## LSD=NonUser_LSD Coke=NonUser_Coke
## 1505 1468
## Mushrooms=NonUser_Mushrooms Ecstasy=NonUser_Ecstasy
## 1451 1368
## Nicotine=User_Nicotine (Other)
## 1060 17653
##
## element (itemset/transaction) length distribution:
## sizes
## 13
## 1885
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13 13 13 13 13 13
##
## includes extended item information - examples:
## labels variables levels
## 1 Nscore=Low_Nscore Nscore Low_Nscore
## 2 Nscore=Med_Nscore Nscore Med_Nscore
## 3 Nscore=High_Nscore Nscore High_Nscore
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
The Apriori algorithm was applied using minimum support of 5% and minimum confidence of 35%, balancing rule strength and interpretability.
rules <- apriori(
transactions,
parameter = list(
supp = 0.05,
conf = 0.35,
minlen = 2
)
)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.35 0.1 1 none FALSE TRUE 5 0.05 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 94
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[33 item(s), 1885 transaction(s)] done [0.00s].
## sorting and recoding items ... [33 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 done [0.06s].
## writing ... [48537 rule(s)] done [0.01s].
## creating S4 object ... done [0.01s].
rules <- sort(rules, by = "lift")
summary(rules)
## set of 48537 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6 7 8 9
## 495 4685 11607 13815 10757 5379 1585 214
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 4.000 5.000 5.096 6.000 9.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.05040 Min. :0.3500 Min. :0.0504 Min. :0.4939
## 1st Qu.:0.05782 1st Qu.:0.5291 1st Qu.:0.0748 1st Qu.:1.1780
## Median :0.06950 Median :0.7651 Median :0.1040 Median :1.2841
## Mean :0.08434 Mean :0.7338 Mean :0.1275 Mean :1.3731
## 3rd Qu.:0.09390 3rd Qu.:0.9608 3rd Qu.:0.1523 3rd Qu.:1.5469
## Max. :0.72042 Max. :1.0000 Max. :0.7984 Max. :3.9824
## count
## Min. : 95
## 1st Qu.: 109
## Median : 131
## Mean : 159
## 3rd Qu.: 177
## Max. :1358
##
## mining info:
## data ntransactions support confidence
## transactions 1885 0.05 0.35
## call
## apriori(data = transactions, parameter = list(supp = 0.05, conf = 0.35, minlen = 2))
The scatter plot illustrates the relationship between support and confidence for all association rules, with color intensity indicating the lift measure.
plot(rules, measure = c("support", "confidence"), shading = "lift")
Most rules show low support but moderate to high confidence, suggesting reliable patterns within specific subgroups rather than the entire population. Rules with higher lift values are concentrated in this region, indicating that the most informative associations are not necessarily the most frequent ones.
hist(quality(rules)$lift,
breaks = 30,
main = "Distribution of Lift Values",
xlab = "Lift",
col = "lightgray",
border = "white")
Most rules exhibit lift values slightly above one, indicating weak to moderate positive associations between antecedents and consequents. A smaller subset of rules reaches substantially higher lift values, representingstrong and potentially meaningful deviations from statistical independence.
To improve interpretability, rules were filtered to include only those where personality traits predict the use of a single substance. Redundant rules were removed.
rules_clean <- subset(
rules,
subset =
rhs %pin% "User_" &
size(rhs) == 1 &
!(lhs %pin% "User_")
)
rules_clean <- rules_clean[!is.redundant(rules_clean)]
rules_top <- head(sort(rules_clean, by = "lift"), 10)
inspect(rules_top)
## lhs rhs support confidence coverage lift count
## [1] {Oscore=High_Oscore,
## SS=High_SS} => {LSD=User_LSD} 0.06578249 0.4960000 0.1326260 2.460421 124
## [2] {Oscore=High_Oscore,
## SS=High_SS} => {Mushrooms=User_Mushrooms} 0.06896552 0.5200000 0.1326260 2.258525 130
## [3] {Oscore=High_Oscore,
## Cscore=Low_Cscore} => {LSD=User_LSD} 0.05411141 0.4322034 0.1251989 2.143956 102
## [4] {Cscore=Low_Cscore,
## SS=High_SS} => {LSD=User_LSD} 0.06100796 0.4307116 0.1416446 2.136556 115
## [5] {Escore=High_Escore,
## Oscore=High_Oscore} => {LSD=User_LSD} 0.05145889 0.4181034 0.1230769 2.074013 97
## [6] {Escore=High_Escore,
## Oscore=High_Oscore} => {Mushrooms=User_Mushrooms} 0.05782493 0.4698276 0.1230769 2.040611 109
## [7] {Oscore=High_Oscore,
## Cscore=Low_Cscore} => {Mushrooms=User_Mushrooms} 0.05835544 0.4661017 0.1251989 2.024428 110
## [8] {Escore=High_Escore,
## SS=High_SS} => {Ecstasy=User_Ecstasy} 0.05941645 0.5463415 0.1087533 1.991980 112
## [9] {Cscore=Low_Cscore,
## SS=High_SS} => {Coke=User_Coke} 0.06206897 0.4382022 0.1416446 1.980842 117
## [10] {Cscore=Low_Cscore,
## SS=High_SS} => {Ecstasy=User_Ecstasy} 0.07692308 0.5430712 0.1416446 1.980056 145
plot(
rules_top,
method = "graph",
measure = "support",
shading = "lift",
main = "Association Rules: Personality Traits → Drug Use"
)
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
The network graph visualizes the strongest association rules by representing personality traits and substances as nodes, with edge direction indicating rule flow from antecedents to consequents. Node size reflects support, while color intensity corresponds to lift, highlighting particularly strong associations involving high sensation seeking and high openness to experience.
The grouped matrix visualization summarizes the strongest association rules by organizing them according to shared antecedent itemsets on the left-hand side and their corresponding consequents on the right-hand side. Dot size represents support, while color intensity indicates lift.
plot(
rules_top,
method = "grouped",
control = list(type = "items")
)
## Available control parameters (with default values):
## k = 20
## aggr.fun = function (x, ...) UseMethod("mean")
## rhs_max = 10
## lhs_label_items = 2
## col = c("#EE0000FF", "#EEEEEEFF")
## groups = NULL
## engine = ggplot2
## verbose = FALSE
plot(
rules_top,
method = "paracoord",
control = list(reorder = TRUE)
)
The strongest rules consistently involve high sensation seeking and high openness to experience predicting the use of psychedelic and stimulant substances such as LSD, mushrooms, ecstasy, and cocaine. This pattern aligns with established psychological theories linking novelty-seeking traits to experimentation with psychoactive substances.
This project demonstrates that Association Rule Mining can successfully uncover meaningful, non-linear relationships between personality traits and drug consumption. The applied preprocessing and filtering steps ensured both methodological rigor and interpretability of results.