“Association rules is a”method for discovering interesting relations between variables in large databases.” https://en.wikipedia.org/wiki/Association_rule_learning
It is one of the most important unsupervised learning techniques, mainly used for Market Basket Analysis, patterns detection in health care data, economic data or consumer behavior. For this project I decided to analyse an interesting set of data about portugeese secondary school students and their alcohol consumption. Thanks to the association rules methods: Apriori Algorithm and Equivalence Class Clustering and bottom-up Lattice Traversal (ECLAT), I’ll be able to identify relations between variables describing students and their alcohol consumption practices.
I start by loading necessary libraries and the dataset. I decide to use just a subset of data, for variables that I find most interesting and that may exibit some patterns. Variables used are: - student’s sex (binary: ‘F’ - female or ‘M’ - male), - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3), - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
library(dplyr)
##
## Dołączanie pakietu: 'dplyr'
## Następujące obiekty zostały zakryte z 'package:stats':
##
## filter, lag
## Następujące obiekty zostały zakryte z 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(arules)
## Ładowanie wymaganego pakietu: Matrix
##
## Dołączanie pakietu: 'arules'
## Następujący obiekt został zakryty z 'package:dplyr':
##
## recode
## Następujące obiekty zostały zakryte z 'package:base':
##
## abbreviate, write
library(arulesViz)
library(arulesCBA)
library(arules)
library(data.table)
##
## Dołączanie pakietu: 'data.table'
## Następujące obiekty zostały zakryte z 'package:dplyr':
##
## between, first, last
selected_columns <- c("sex", "famsize", "Mjob", "Fjob", "reason", "Dalc")
dane <- read.csv("student-mat.csv", sep=",", header = TRUE, na.strings = "")[, selected_columns]
df <- data.frame(dane)
df$Dalc <- as.character(df$Dalc)
students <- na.omit(df)
cat("Number of observations in the dataset:", nrow(students))
## Number of observations in the dataset: 395
For association rules techniques to be performed in R, the dataset has to have the specific transactions class. I convert the dataset and inspect some of the observations. I also take a look at the frequency of variables values in dataset.
dt <- data.table(students)
trans <- as(dt, 'transactions')
## Warning: Column(s) 1, 2, 3, 4, 5, 6 not logical or factor. Applying default
## discretization (see '? discretizeDF').
summary(trans)
## transactions as itemMatrix in sparse format with
## 395 rows (elements/itemsets/transactions) and
## 23 columns (items) and a density of 0.2608696
##
## most frequent items:
## famsize=GT3 Dalc=1 Fjob=other sex=F sex=M (Other)
## 281 276 217 208 187 1201
##
## element (itemset/transaction) length distribution:
## sizes
## 6
## 395
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6 6 6 6 6 6
##
## includes extended item information - examples:
## labels variables levels
## 1 sex=F sex F
## 2 sex=M sex M
## 3 famsize=GT3 famsize GT3
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
inspect(trans[1:5])
## items transactionID
## [1] {sex=F,
## famsize=GT3,
## Mjob=at_home,
## Fjob=teacher,
## reason=course,
## Dalc=1} 1
## [2] {sex=F,
## famsize=GT3,
## Mjob=at_home,
## Fjob=other,
## reason=course,
## Dalc=1} 2
## [3] {sex=F,
## famsize=LE3,
## Mjob=at_home,
## Fjob=other,
## reason=other,
## Dalc=2} 3
## [4] {sex=F,
## famsize=GT3,
## Mjob=health,
## Fjob=services,
## reason=home,
## Dalc=1} 4
## [5] {sex=F,
## famsize=GT3,
## Mjob=other,
## Fjob=other,
## reason=home,
## Dalc=1} 5
itemFrequency(trans, type = "absolute")
## sex=F sex=M famsize=GT3 famsize=LE3
## 208 187 281 114
## Mjob=at_home Mjob=health Mjob=other Mjob=services
## 59 34 141 103
## Mjob=teacher Fjob=at_home Fjob=health Fjob=other
## 58 20 18 217
## Fjob=services Fjob=teacher reason=course reason=home
## 111 29 145 109
## reason=other reason=reputation Dalc=1 Dalc=2
## 36 105 276 75
## Dalc=3 Dalc=4 Dalc=5
## 26 9 9
itemFrequency(trans, type = "relative")
## sex=F sex=M famsize=GT3 famsize=LE3
## 0.52658228 0.47341772 0.71139241 0.28860759
## Mjob=at_home Mjob=health Mjob=other Mjob=services
## 0.14936709 0.08607595 0.35696203 0.26075949
## Mjob=teacher Fjob=at_home Fjob=health Fjob=other
## 0.14683544 0.05063291 0.04556962 0.54936709
## Fjob=services Fjob=teacher reason=course reason=home
## 0.28101266 0.07341772 0.36708861 0.27594937
## reason=other reason=reputation Dalc=1 Dalc=2
## 0.09113924 0.26582278 0.69873418 0.18987342
## Dalc=3 Dalc=4 Dalc=5
## 0.06582278 0.02278481 0.02278481
itemFrequencyPlot(trans, topN=10, type="relative", main="ItemFrequency")
It can be observed that many of the students come from big families, and claim that they rarely consume alcohol. The nuber of male and female taking part in this research is more less the same.
set.seed(240)
image(sample(trans,50))
Talking about association rules, one has to mention the three measures used to evaluate obtained rules. Those measures are: confidence, lift and support.
Support refers to how often a particular itemset or rule appears in the data. It can be thought of as the relative frequency of the item or rule.
Confidence is a percentage value that describes the proportion of transactions where the presence of a given item (or itemset) leads to the presence of another item (or itemset). A higher confidence value indicates a stronger rule.
Lift is a value that measures the increase in the probability of having item X in the transacion when item Y is present, compared to the probability of having item X in the transaction without any knowledge of item Y’s presence.
The Apriori algorithm helps us to make use of what we already know about frequent observations. By setting a low support level at the start, we can actually reduce the number of rules we need to analyze. However the parameters should be adjusted according to the case and data analysed.
rules<-apriori(trans, parameter=list(supp=0.2, conf=0.5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.2 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 79
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[23 item(s), 395 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [46 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
There are 46 rules found. Let’s take a look at some of the provided rules.
inspect(rules[5:15])
## lhs rhs support confidence coverage
## [1] {reason=reputation} => {Dalc=1} 0.2075949 0.7809524 0.2658228
## [2] {Fjob=services} => {famsize=GT3} 0.2025316 0.7207207 0.2810127
## [3] {Mjob=other} => {Fjob=other} 0.2632911 0.7375887 0.3569620
## [4] {Mjob=other} => {Dalc=1} 0.2531646 0.7092199 0.3569620
## [5] {Mjob=other} => {famsize=GT3} 0.2683544 0.7517730 0.3569620
## [6] {reason=course} => {Fjob=other} 0.2075949 0.5655172 0.3670886
## [7] {reason=course} => {Dalc=1} 0.2556962 0.6965517 0.3670886
## [8] {reason=course} => {famsize=GT3} 0.2556962 0.6965517 0.3670886
## [9] {sex=M} => {Fjob=other} 0.2683544 0.5668449 0.4734177
## [10] {sex=M} => {Dalc=1} 0.2759494 0.5828877 0.4734177
## [11] {sex=M} => {famsize=GT3} 0.3164557 0.6684492 0.4734177
## lift count
## [1] 1.1176674 82
## [2] 1.0131128 80
## [3] 1.3426153 104
## [4] 1.0150067 100
## [5] 1.0567628 106
## [6] 1.0293977 82
## [7] 0.9968766 101
## [8] 0.9791385 101
## [9] 1.0318145 106
## [10] 0.8342052 109
## [11] 0.9396350 125
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
inspect(sort(rules, by = "confidence")[1:5])
## lhs rhs support confidence coverage
## [1] {sex=F, Fjob=other} => {Dalc=1} 0.2405063 0.8558559 0.2810127
## [2] {sex=F, famsize=GT3} => {Dalc=1} 0.3265823 0.8269231 0.3949367
## [3] {sex=F} => {Dalc=1} 0.4227848 0.8028846 0.5265823
## [4] {Mjob=other, Dalc=1} => {famsize=GT3} 0.2000000 0.7900000 0.2531646
## [5] {reason=reputation} => {Dalc=1} 0.2075949 0.7809524 0.2658228
## lift count
## [1] 1.224866 95
## [2] 1.183459 129
## [3] 1.149056 167
## [4] 1.110498 79
## [5] 1.117667 82
inspect(sort(rules, by = "lift")[1:5])
## lhs rhs support confidence coverage
## [1] {famsize=GT3, Fjob=other} => {Mjob=other} 0.2000000 0.5163399 0.3873418
## [2] {famsize=GT3, Mjob=other} => {Fjob=other} 0.2000000 0.7452830 0.2683544
## [3] {Mjob=other} => {Fjob=other} 0.2632911 0.7375887 0.3569620
## [4] {sex=F, Fjob=other} => {Dalc=1} 0.2405063 0.8558559 0.2810127
## [5] {sex=F, famsize=GT3} => {Dalc=1} 0.3265823 0.8269231 0.3949367
## lift count
## [1] 1.446484 79
## [2] 1.356621 79
## [3] 1.342615 104
## [4] 1.224866 95
## [5] 1.183459 129
inspect(sort(rules, by = "support")[1:5])
## lhs rhs support confidence coverage lift count
## [1] {} => {famsize=GT3} 0.7113924 0.7113924 1.0000000 1.000000 281
## [2] {} => {Dalc=1} 0.6987342 0.6987342 1.0000000 1.000000 276
## [3] {} => {Fjob=other} 0.5493671 0.5493671 1.0000000 1.000000 217
## [4] {} => {sex=F} 0.5265823 0.5265823 1.0000000 1.000000 208
## [5] {Dalc=1} => {famsize=GT3} 0.5265823 0.7536232 0.6987342 1.059364 208
set.seed(240)
plot(rules, method="graph", measure="support", shading="lift", main="20 rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout = stress
## circular = FALSE
## ggraphdots = NULL
## edges = <environment>
## nodes = <environment>
## nodetext = <environment>
## colors = c("#EE0000FF", "#EEEEEEFF")
## engine = ggplot2
## max = 100
## verbose = FALSE
plot(rules, method = "matrix", measire = "lift")
## Itemsets in Antecedent (LHS)
## [1] "{famsize=GT3,Mjob=other}" "{famsize=GT3,Fjob=other}"
## [3] "{Mjob=other}" "{sex=F,Fjob=other}"
## [5] "{reason=reputation}" "{Mjob=other,Dalc=1}"
## [7] "{famsize=GT3,Dalc=1}" "{Fjob=other,Dalc=1}"
## [9] "{Dalc=1}" "{sex=F,famsize=GT3}"
## [11] "{Mjob=other,Fjob=other}" "{sex=F,Dalc=1}"
## [13] "{sex=F}" "{famsize=GT3}"
## [15] "{sex=M,Dalc=1}" "{Fjob=services}"
## [17] "{reason=course}" "{Fjob=other}"
## [19] "{}" "{sex=M}"
## [21] "{sex=M,famsize=GT3}"
## Itemsets in Consequent (RHS)
## [1] "{famsize=GT3}" "{Dalc=1}" "{Fjob=other}" "{sex=F}"
## [5] "{Mjob=other}"
plot(rules, method = "grouped")
plot(rules, method="paracoord", control=list(reorder=TRUE))
No rules on the high alcohol consumption were found
The ECLAT algorithm doesn’t generate rules. It explores frequent sets to narrow down the dataset. Thanks to that we acquire frequent sets and calculate their corresponding measures. Length of the set be restricted to a given number.
eclat_students <- eclat(trans, parameter = list(supp = 0.2, maxlen = 15))
## Eclat
##
## parameter specification:
## tidLists support minlen maxlen target ext
## FALSE 0.2 1 15 frequent itemsets TRUE
##
## algorithmic control:
## sparse sort verbose
## 7 -2 TRUE
##
## Absolute minimum support count: 79
##
## create itemset ...
## set transactions ...[23 item(s), 395 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating bit matrix ... [12 row(s), 395 column(s)] done [0.00s].
## writing ... [36 set(s)] done [0.00s].
## Creating S4 object ... done [0.00s].
inspect(eclat_students)
## items support count
## [1] {reason=reputation, Dalc=1} 0.2075949 82
## [2] {famsize=GT3, Fjob=services} 0.2025316 80
## [3] {famsize=GT3, Mjob=other, Fjob=other} 0.2000000 79
## [4] {famsize=GT3, Mjob=other, Dalc=1} 0.2000000 79
## [5] {famsize=GT3, Mjob=other} 0.2683544 106
## [6] {Mjob=other, Dalc=1} 0.2531646 100
## [7] {Mjob=other, Fjob=other} 0.2632911 104
## [8] {famsize=GT3, reason=course} 0.2556962 101
## [9] {reason=course, Dalc=1} 0.2556962 101
## [10] {Fjob=other, reason=course} 0.2075949 82
## [11] {sex=M, famsize=GT3, Dalc=1} 0.2000000 79
## [12] {sex=M, famsize=GT3} 0.3164557 125
## [13] {sex=M, Dalc=1} 0.2759494 109
## [14] {sex=M, Fjob=other} 0.2683544 106
## [15] {sex=F, famsize=GT3, Fjob=other} 0.2101266 83
## [16] {sex=F, Fjob=other, Dalc=1} 0.2405063 95
## [17] {sex=F, famsize=GT3, Dalc=1} 0.3265823 129
## [18] {sex=F, famsize=GT3} 0.3949367 156
## [19] {sex=F, Dalc=1} 0.4227848 167
## [20] {sex=F, Fjob=other} 0.2810127 111
## [21] {famsize=GT3, Fjob=other, Dalc=1} 0.3012658 119
## [22] {famsize=GT3, Fjob=other} 0.3873418 153
## [23] {Fjob=other, Dalc=1} 0.4000000 158
## [24] {famsize=GT3, Dalc=1} 0.5265823 208
## [25] {famsize=GT3} 0.7113924 281
## [26] {Dalc=1} 0.6987342 276
## [27] {Fjob=other} 0.5493671 217
## [28] {sex=F} 0.5265823 208
## [29] {sex=M} 0.4734177 187
## [30] {reason=course} 0.3670886 145
## [31] {Mjob=other} 0.3569620 141
## [32] {famsize=LE3} 0.2886076 114
## [33] {Fjob=services} 0.2810127 111
## [34] {reason=home} 0.2759494 109
## [35] {reason=reputation} 0.2658228 105
## [36] {Mjob=services} 0.2607595 103
Next I create rules and later display and inspect the sets and rules.
rules_eclat <- ruleInduction(eclat_students, trans, confidence = 0.5)
rules_eclat
## set of 42 rules
inspect(rules_eclat[1:10])
## lhs rhs support confidence lift
## [1] {reason=reputation} => {Dalc=1} 0.2075949 0.7809524 1.117667
## [2] {Fjob=services} => {famsize=GT3} 0.2025316 0.7207207 1.013113
## [3] {Mjob=other, Fjob=other} => {famsize=GT3} 0.2000000 0.7596154 1.067787
## [4] {famsize=GT3, Fjob=other} => {Mjob=other} 0.2000000 0.5163399 1.446484
## [5] {famsize=GT3, Mjob=other} => {Fjob=other} 0.2000000 0.7452830 1.356621
## [6] {Mjob=other, Dalc=1} => {famsize=GT3} 0.2000000 0.7900000 1.110498
## [7] {famsize=GT3, Mjob=other} => {Dalc=1} 0.2000000 0.7452830 1.066619
## [8] {Mjob=other} => {famsize=GT3} 0.2683544 0.7517730 1.056763
## [9] {Mjob=other} => {Dalc=1} 0.2531646 0.7092199 1.015007
## [10] {Mjob=other} => {Fjob=other} 0.2632911 0.7375887 1.342615
## itemset
## [1] 1
## [2] 2
## [3] 3
## [4] 3
## [5] 3
## [6] 4
## [7] 4
## [8] 5
## [9] 6
## [10] 7
plot(rules_eclat)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
plot(rules_eclat, method = "grouped")
plot(rules_eclat, method="paracoord", control=list(reorder=TRUE))
Identifying redundant relationships.
redun <- rules_eclat[is.redundant(rules_eclat) == TRUE]
inspect(redun)
## lhs rhs support confidence lift
## [1] {sex=M, Dalc=1} => {famsize=GT3} 0.2000000 0.7247706 1.0188057
## [2] {sex=M, famsize=GT3} => {Dalc=1} 0.2000000 0.6320000 0.9044928
## [3] {famsize=GT3, Fjob=other} => {sex=F} 0.2101266 0.5424837 1.0301973
## [4] {sex=F, Fjob=other} => {famsize=GT3} 0.2101266 0.7477477 1.0511045
## [5] {sex=F, famsize=GT3} => {Fjob=other} 0.2101266 0.5320513 0.9684804
## [6] {Fjob=other, Dalc=1} => {sex=F} 0.2405063 0.6012658 1.1418269
## [7] {sex=F, Dalc=1} => {Fjob=other} 0.2405063 0.5688623 1.0354866
## [8] {Fjob=other, Dalc=1} => {famsize=GT3} 0.3012658 0.7531646 1.0587189
## [9] {famsize=GT3, Dalc=1} => {Fjob=other} 0.3012658 0.5721154 1.0414082
## itemset
## [1] 11
## [2] 11
## [3] 15
## [4] 15
## [5] 15
## [6] 16
## [7] 16
## [8] 21
## [9] 21
The association rules techniques allowed me to find the relationships between dataset’s observations. However, no significant patterns regarding student’s alcohol consumption were detected.