Introduction

“Association rules is a”method for discovering interesting relations between variables in large databases.” https://en.wikipedia.org/wiki/Association_rule_learning

It is one of the most important unsupervised learning techniques, mainly used for Market Basket Analysis, patterns detection in health care data, economic data or consumer behavior. For this project I decided to analyse an interesting set of data about portugeese secondary school students and their alcohol consumption. Thanks to the association rules methods: Apriori Algorithm and Equivalence Class Clustering and bottom-up Lattice Traversal (ECLAT), I’ll be able to identify relations between variables describing students and their alcohol consumption practices.

Students Dataset

I start by loading necessary libraries and the dataset. I decide to use just a subset of data, for variables that I find most interesting and that may exibit some patterns. Variables used are: - student’s sex (binary: ‘F’ - female or ‘M’ - male), - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3), - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’) - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’) - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)

library(dplyr)
## 
## Dołączanie pakietu: 'dplyr'
## Następujące obiekty zostały zakryte z 'package:stats':
## 
##     filter, lag
## Następujące obiekty zostały zakryte z 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(arules)
## Ładowanie wymaganego pakietu: Matrix
## 
## Dołączanie pakietu: 'arules'
## Następujący obiekt został zakryty z 'package:dplyr':
## 
##     recode
## Następujące obiekty zostały zakryte z 'package:base':
## 
##     abbreviate, write
library(arulesViz)
library(arulesCBA)
library(arules)
library(data.table)
## 
## Dołączanie pakietu: 'data.table'
## Następujące obiekty zostały zakryte z 'package:dplyr':
## 
##     between, first, last
selected_columns <- c("sex", "famsize", "Mjob", "Fjob", "reason", "Dalc")
dane <- read.csv("student-mat.csv", sep=",", header = TRUE, na.strings = "")[, selected_columns]
df <- data.frame(dane)

df$Dalc <- as.character(df$Dalc)

students <- na.omit(df)
cat("Number of observations in the dataset:", nrow(students))
## Number of observations in the dataset: 395

For association rules techniques to be performed in R, the dataset has to have the specific transactions class. I convert the dataset and inspect some of the observations. I also take a look at the frequency of variables values in dataset.

dt <- data.table(students)
trans <- as(dt, 'transactions')
## Warning: Column(s) 1, 2, 3, 4, 5, 6 not logical or factor. Applying default
## discretization (see '? discretizeDF').
summary(trans)
## transactions as itemMatrix in sparse format with
##  395 rows (elements/itemsets/transactions) and
##  23 columns (items) and a density of 0.2608696 
## 
## most frequent items:
## famsize=GT3      Dalc=1  Fjob=other       sex=F       sex=M     (Other) 
##         281         276         217         208         187        1201 
## 
## element (itemset/transaction) length distribution:
## sizes
##   6 
## 395 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       6       6       6       6       6       6 
## 
## includes extended item information - examples:
##        labels variables levels
## 1       sex=F       sex      F
## 2       sex=M       sex      M
## 3 famsize=GT3   famsize    GT3
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3
inspect(trans[1:5])
##     items            transactionID
## [1] {sex=F,                       
##      famsize=GT3,                 
##      Mjob=at_home,                
##      Fjob=teacher,                
##      reason=course,               
##      Dalc=1}                     1
## [2] {sex=F,                       
##      famsize=GT3,                 
##      Mjob=at_home,                
##      Fjob=other,                  
##      reason=course,               
##      Dalc=1}                     2
## [3] {sex=F,                       
##      famsize=LE3,                 
##      Mjob=at_home,                
##      Fjob=other,                  
##      reason=other,                
##      Dalc=2}                     3
## [4] {sex=F,                       
##      famsize=GT3,                 
##      Mjob=health,                 
##      Fjob=services,               
##      reason=home,                 
##      Dalc=1}                     4
## [5] {sex=F,                       
##      famsize=GT3,                 
##      Mjob=other,                  
##      Fjob=other,                  
##      reason=home,                 
##      Dalc=1}                     5
itemFrequency(trans, type = "absolute")
##             sex=F             sex=M       famsize=GT3       famsize=LE3 
##               208               187               281               114 
##      Mjob=at_home       Mjob=health        Mjob=other     Mjob=services 
##                59                34               141               103 
##      Mjob=teacher      Fjob=at_home       Fjob=health        Fjob=other 
##                58                20                18               217 
##     Fjob=services      Fjob=teacher     reason=course       reason=home 
##               111                29               145               109 
##      reason=other reason=reputation            Dalc=1            Dalc=2 
##                36               105               276                75 
##            Dalc=3            Dalc=4            Dalc=5 
##                26                 9                 9
itemFrequency(trans, type = "relative")
##             sex=F             sex=M       famsize=GT3       famsize=LE3 
##        0.52658228        0.47341772        0.71139241        0.28860759 
##      Mjob=at_home       Mjob=health        Mjob=other     Mjob=services 
##        0.14936709        0.08607595        0.35696203        0.26075949 
##      Mjob=teacher      Fjob=at_home       Fjob=health        Fjob=other 
##        0.14683544        0.05063291        0.04556962        0.54936709 
##     Fjob=services      Fjob=teacher     reason=course       reason=home 
##        0.28101266        0.07341772        0.36708861        0.27594937 
##      reason=other reason=reputation            Dalc=1            Dalc=2 
##        0.09113924        0.26582278        0.69873418        0.18987342 
##            Dalc=3            Dalc=4            Dalc=5 
##        0.06582278        0.02278481        0.02278481
itemFrequencyPlot(trans, topN=10, type="relative", main="ItemFrequency") 

It can be observed that many of the students come from big families, and claim that they rarely consume alcohol. The nuber of male and female taking part in this research is more less the same.

set.seed(240)
image(sample(trans,50))

Association Rules

Talking about association rules, one has to mention the three measures used to evaluate obtained rules. Those measures are: confidence, lift and support.

Support refers to how often a particular itemset or rule appears in the data. It can be thought of as the relative frequency of the item or rule.

Confidence is a percentage value that describes the proportion of transactions where the presence of a given item (or itemset) leads to the presence of another item (or itemset). A higher confidence value indicates a stronger rule.

Lift is a value that measures the increase in the probability of having item X in the transacion when item Y is present, compared to the probability of having item X in the transaction without any knowledge of item Y’s presence.

Apriori Algorithm

The Apriori algorithm helps us to make use of what we already know about frequent observations. By setting a low support level at the start, we can actually reduce the number of rules we need to analyze. However the parameters should be adjusted according to the case and data analysed.

rules<-apriori(trans, parameter=list(supp=0.2, conf=0.5)) 
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5     0.2      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 79 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[23 item(s), 395 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [46 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

There are 46 rules found. Let’s take a look at some of the provided rules.

inspect(rules[5:15])
##      lhs                    rhs           support   confidence coverage 
## [1]  {reason=reputation} => {Dalc=1}      0.2075949 0.7809524  0.2658228
## [2]  {Fjob=services}     => {famsize=GT3} 0.2025316 0.7207207  0.2810127
## [3]  {Mjob=other}        => {Fjob=other}  0.2632911 0.7375887  0.3569620
## [4]  {Mjob=other}        => {Dalc=1}      0.2531646 0.7092199  0.3569620
## [5]  {Mjob=other}        => {famsize=GT3} 0.2683544 0.7517730  0.3569620
## [6]  {reason=course}     => {Fjob=other}  0.2075949 0.5655172  0.3670886
## [7]  {reason=course}     => {Dalc=1}      0.2556962 0.6965517  0.3670886
## [8]  {reason=course}     => {famsize=GT3} 0.2556962 0.6965517  0.3670886
## [9]  {sex=M}             => {Fjob=other}  0.2683544 0.5668449  0.4734177
## [10] {sex=M}             => {Dalc=1}      0.2759494 0.5828877  0.4734177
## [11] {sex=M}             => {famsize=GT3} 0.3164557 0.6684492  0.4734177
##      lift      count
## [1]  1.1176674  82  
## [2]  1.0131128  80  
## [3]  1.3426153 104  
## [4]  1.0150067 100  
## [5]  1.0567628 106  
## [6]  1.0293977  82  
## [7]  0.9968766 101  
## [8]  0.9791385 101  
## [9]  1.0318145 106  
## [10] 0.8342052 109  
## [11] 0.9396350 125
plot(rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

inspect(sort(rules, by = "confidence")[1:5])
##     lhs                     rhs           support   confidence coverage 
## [1] {sex=F, Fjob=other}  => {Dalc=1}      0.2405063 0.8558559  0.2810127
## [2] {sex=F, famsize=GT3} => {Dalc=1}      0.3265823 0.8269231  0.3949367
## [3] {sex=F}              => {Dalc=1}      0.4227848 0.8028846  0.5265823
## [4] {Mjob=other, Dalc=1} => {famsize=GT3} 0.2000000 0.7900000  0.2531646
## [5] {reason=reputation}  => {Dalc=1}      0.2075949 0.7809524  0.2658228
##     lift     count
## [1] 1.224866  95  
## [2] 1.183459 129  
## [3] 1.149056 167  
## [4] 1.110498  79  
## [5] 1.117667  82
inspect(sort(rules, by = "lift")[1:5])
##     lhs                          rhs          support   confidence coverage 
## [1] {famsize=GT3, Fjob=other} => {Mjob=other} 0.2000000 0.5163399  0.3873418
## [2] {famsize=GT3, Mjob=other} => {Fjob=other} 0.2000000 0.7452830  0.2683544
## [3] {Mjob=other}              => {Fjob=other} 0.2632911 0.7375887  0.3569620
## [4] {sex=F, Fjob=other}       => {Dalc=1}     0.2405063 0.8558559  0.2810127
## [5] {sex=F, famsize=GT3}      => {Dalc=1}     0.3265823 0.8269231  0.3949367
##     lift     count
## [1] 1.446484  79  
## [2] 1.356621  79  
## [3] 1.342615 104  
## [4] 1.224866  95  
## [5] 1.183459 129
inspect(sort(rules, by = "support")[1:5])
##     lhs         rhs           support   confidence coverage  lift     count
## [1] {}       => {famsize=GT3} 0.7113924 0.7113924  1.0000000 1.000000 281  
## [2] {}       => {Dalc=1}      0.6987342 0.6987342  1.0000000 1.000000 276  
## [3] {}       => {Fjob=other}  0.5493671 0.5493671  1.0000000 1.000000 217  
## [4] {}       => {sex=F}       0.5265823 0.5265823  1.0000000 1.000000 208  
## [5] {Dalc=1} => {famsize=GT3} 0.5265823 0.7536232  0.6987342 1.059364 208
set.seed(240) 
plot(rules, method="graph", measure="support", shading="lift", main="20 rules")
## Warning: Unknown control parameters: main
## Available control parameters (with default values):
## layout    =  stress
## circular  =  FALSE
## ggraphdots    =  NULL
## edges     =  <environment>
## nodes     =  <environment>
## nodetext  =  <environment>
## colors    =  c("#EE0000FF", "#EEEEEEFF")
## engine    =  ggplot2
## max   =  100
## verbose   =  FALSE

plot(rules, method = "matrix", measire = "lift")
## Itemsets in Antecedent (LHS)
##  [1] "{famsize=GT3,Mjob=other}" "{famsize=GT3,Fjob=other}"
##  [3] "{Mjob=other}"             "{sex=F,Fjob=other}"      
##  [5] "{reason=reputation}"      "{Mjob=other,Dalc=1}"     
##  [7] "{famsize=GT3,Dalc=1}"     "{Fjob=other,Dalc=1}"     
##  [9] "{Dalc=1}"                 "{sex=F,famsize=GT3}"     
## [11] "{Mjob=other,Fjob=other}"  "{sex=F,Dalc=1}"          
## [13] "{sex=F}"                  "{famsize=GT3}"           
## [15] "{sex=M,Dalc=1}"           "{Fjob=services}"         
## [17] "{reason=course}"          "{Fjob=other}"            
## [19] "{}"                       "{sex=M}"                 
## [21] "{sex=M,famsize=GT3}"     
## Itemsets in Consequent (RHS)
## [1] "{famsize=GT3}" "{Dalc=1}"      "{Fjob=other}"  "{sex=F}"      
## [5] "{Mjob=other}"

plot(rules, method = "grouped")

plot(rules, method="paracoord", control=list(reorder=TRUE))

No rules on the high alcohol consumption were found

ECLAT

The ECLAT algorithm doesn’t generate rules. It explores frequent sets to narrow down the dataset. Thanks to that we acquire frequent sets and calculate their corresponding measures. Length of the set be restricted to a given number.

eclat_students <- eclat(trans, parameter = list(supp = 0.2, maxlen = 15))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE     0.2      1     15 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 79 
## 
## create itemset ... 
## set transactions ...[23 item(s), 395 transaction(s)] done [0.00s].
## sorting and recoding items ... [12 item(s)] done [0.00s].
## creating bit matrix ... [12 row(s), 395 column(s)] done [0.00s].
## writing  ... [36 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
inspect(eclat_students)
##      items                                 support   count
## [1]  {reason=reputation, Dalc=1}           0.2075949  82  
## [2]  {famsize=GT3, Fjob=services}          0.2025316  80  
## [3]  {famsize=GT3, Mjob=other, Fjob=other} 0.2000000  79  
## [4]  {famsize=GT3, Mjob=other, Dalc=1}     0.2000000  79  
## [5]  {famsize=GT3, Mjob=other}             0.2683544 106  
## [6]  {Mjob=other, Dalc=1}                  0.2531646 100  
## [7]  {Mjob=other, Fjob=other}              0.2632911 104  
## [8]  {famsize=GT3, reason=course}          0.2556962 101  
## [9]  {reason=course, Dalc=1}               0.2556962 101  
## [10] {Fjob=other, reason=course}           0.2075949  82  
## [11] {sex=M, famsize=GT3, Dalc=1}          0.2000000  79  
## [12] {sex=M, famsize=GT3}                  0.3164557 125  
## [13] {sex=M, Dalc=1}                       0.2759494 109  
## [14] {sex=M, Fjob=other}                   0.2683544 106  
## [15] {sex=F, famsize=GT3, Fjob=other}      0.2101266  83  
## [16] {sex=F, Fjob=other, Dalc=1}           0.2405063  95  
## [17] {sex=F, famsize=GT3, Dalc=1}          0.3265823 129  
## [18] {sex=F, famsize=GT3}                  0.3949367 156  
## [19] {sex=F, Dalc=1}                       0.4227848 167  
## [20] {sex=F, Fjob=other}                   0.2810127 111  
## [21] {famsize=GT3, Fjob=other, Dalc=1}     0.3012658 119  
## [22] {famsize=GT3, Fjob=other}             0.3873418 153  
## [23] {Fjob=other, Dalc=1}                  0.4000000 158  
## [24] {famsize=GT3, Dalc=1}                 0.5265823 208  
## [25] {famsize=GT3}                         0.7113924 281  
## [26] {Dalc=1}                              0.6987342 276  
## [27] {Fjob=other}                          0.5493671 217  
## [28] {sex=F}                               0.5265823 208  
## [29] {sex=M}                               0.4734177 187  
## [30] {reason=course}                       0.3670886 145  
## [31] {Mjob=other}                          0.3569620 141  
## [32] {famsize=LE3}                         0.2886076 114  
## [33] {Fjob=services}                       0.2810127 111  
## [34] {reason=home}                         0.2759494 109  
## [35] {reason=reputation}                   0.2658228 105  
## [36] {Mjob=services}                       0.2607595 103

Next I create rules and later display and inspect the sets and rules.

rules_eclat <- ruleInduction(eclat_students, trans, confidence = 0.5)
rules_eclat
## set of 42 rules
inspect(rules_eclat[1:10])
##      lhs                          rhs           support   confidence lift    
## [1]  {reason=reputation}       => {Dalc=1}      0.2075949 0.7809524  1.117667
## [2]  {Fjob=services}           => {famsize=GT3} 0.2025316 0.7207207  1.013113
## [3]  {Mjob=other, Fjob=other}  => {famsize=GT3} 0.2000000 0.7596154  1.067787
## [4]  {famsize=GT3, Fjob=other} => {Mjob=other}  0.2000000 0.5163399  1.446484
## [5]  {famsize=GT3, Mjob=other} => {Fjob=other}  0.2000000 0.7452830  1.356621
## [6]  {Mjob=other, Dalc=1}      => {famsize=GT3} 0.2000000 0.7900000  1.110498
## [7]  {famsize=GT3, Mjob=other} => {Dalc=1}      0.2000000 0.7452830  1.066619
## [8]  {Mjob=other}              => {famsize=GT3} 0.2683544 0.7517730  1.056763
## [9]  {Mjob=other}              => {Dalc=1}      0.2531646 0.7092199  1.015007
## [10] {Mjob=other}              => {Fjob=other}  0.2632911 0.7375887  1.342615
##      itemset
## [1]  1      
## [2]  2      
## [3]  3      
## [4]  3      
## [5]  3      
## [6]  4      
## [7]  4      
## [8]  5      
## [9]  6      
## [10] 7
plot(rules_eclat)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

plot(rules_eclat, method = "grouped")

plot(rules_eclat, method="paracoord", control=list(reorder=TRUE))

Identifying redundant relationships.

redun <- rules_eclat[is.redundant(rules_eclat) == TRUE]
inspect(redun)
##     lhs                          rhs           support   confidence lift     
## [1] {sex=M, Dalc=1}           => {famsize=GT3} 0.2000000 0.7247706  1.0188057
## [2] {sex=M, famsize=GT3}      => {Dalc=1}      0.2000000 0.6320000  0.9044928
## [3] {famsize=GT3, Fjob=other} => {sex=F}       0.2101266 0.5424837  1.0301973
## [4] {sex=F, Fjob=other}       => {famsize=GT3} 0.2101266 0.7477477  1.0511045
## [5] {sex=F, famsize=GT3}      => {Fjob=other}  0.2101266 0.5320513  0.9684804
## [6] {Fjob=other, Dalc=1}      => {sex=F}       0.2405063 0.6012658  1.1418269
## [7] {sex=F, Dalc=1}           => {Fjob=other}  0.2405063 0.5688623  1.0354866
## [8] {Fjob=other, Dalc=1}      => {famsize=GT3} 0.3012658 0.7531646  1.0587189
## [9] {famsize=GT3, Dalc=1}     => {Fjob=other}  0.3012658 0.5721154  1.0414082
##     itemset
## [1] 11     
## [2] 11     
## [3] 15     
## [4] 15     
## [5] 15     
## [6] 16     
## [7] 16     
## [8] 21     
## [9] 21

Summary

The association rules techniques allowed me to find the relationships between dataset’s observations. However, no significant patterns regarding student’s alcohol consumption were detected.