Market basket analysis

Introduction

I used market basket analysis with student performance data from 2 schools in Portugal to find out what factors influence whether students pass or not final exam. In other words, I selected student pass or fail data as rhs. And I selected student other profile information as lhs.

Packages

library(readr)
library(tidyverse)
library(arules)
library(arulesViz)

Data preparetion

Firstly, I imported data from machine learning repository website. I removed X rownumber variable from data. Then, I added column names to values to make result of association rule easier to interpret.

df=read.csv("df.csv")
df=df %>% select(-X)
data=df
for (i in 1:31) {
  x=colnames(df)[i]
  df[,i]<-paste0(x,'-',df[,i])
}
head(df)

##      school   sex    age   address     famsize   Pstatus   Medu   Fedu
## 1 school-GP sex-F age-18 address-U famsize-GT3 Pstatus-A Medu-4 Fedu-4
## 2 school-GP sex-F age-17 address-U famsize-GT3 Pstatus-T Medu-1 Fedu-1
## 3 school-GP sex-F age-15 address-U famsize-LE3 Pstatus-T Medu-1 Fedu-1
## 4 school-GP sex-F age-15 address-U famsize-GT3 Pstatus-T Medu-4 Fedu-2
## 5 school-GP sex-F age-16 address-U famsize-GT3 Pstatus-T Medu-3 Fedu-3
## 6 school-GP sex-M age-16 address-U famsize-LE3 Pstatus-T Medu-4 Fedu-3
##            Mjob          Fjob            reason        guardian   traveltime
## 1  Mjob-at_home  Fjob-teacher     reason-course guardian-mother traveltime-2
## 2  Mjob-at_home    Fjob-other     reason-course guardian-father traveltime-1
## 3  Mjob-at_home    Fjob-other      reason-other guardian-mother traveltime-1
## 4   Mjob-health Fjob-services       reason-home guardian-mother traveltime-1
## 5    Mjob-other    Fjob-other       reason-home guardian-father traveltime-1
## 6 Mjob-services    Fjob-other reason-reputation guardian-mother traveltime-1
##     studytime   failures     schoolsup     famsup    paid     activities
## 1 studytime-2 failures-0 schoolsup-yes  famsup-no paid-no  activities-no
## 2 studytime-2 failures-0  schoolsup-no famsup-yes paid-no  activities-no
## 3 studytime-2 failures-0 schoolsup-yes  famsup-no paid-no  activities-no
## 4 studytime-3 failures-0  schoolsup-no famsup-yes paid-no activities-yes
## 5 studytime-2 failures-0  schoolsup-no famsup-yes paid-no  activities-no
## 6 studytime-2 failures-0  schoolsup-no famsup-yes paid-no activities-yes
##       nursery     higher     internet     romantic   famrel   freetime   goout
## 1 nursery-yes higher-yes  internet-no  romantic-no famrel-4 freetime-3 goout-4
## 2  nursery-no higher-yes internet-yes  romantic-no famrel-5 freetime-3 goout-3
## 3 nursery-yes higher-yes internet-yes  romantic-no famrel-4 freetime-3 goout-2
## 4 nursery-yes higher-yes internet-yes romantic-yes famrel-3 freetime-2 goout-2
## 5 nursery-yes higher-yes  internet-no  romantic-no famrel-4 freetime-3 goout-2
## 6 nursery-yes higher-yes internet-yes  romantic-no famrel-5 freetime-4 goout-2
##     Dalc   Walc   health   absences         final
## 1 Dalc-1 Walc-1 health-3 absences-4 final-success
## 2 Dalc-1 Walc-1 health-3 absences-2 final-success
## 3 Dalc-2 Walc-3 health-3 absences-6 final-success
## 4 Dalc-1 Walc-1 health-5 absences-0 final-success
## 5 Dalc-1 Walc-2 health-5 absences-0 final-success
## 6 Dalc-1 Walc-2 health-5 absences-6 final-success

Exploratory Data Analysis

Variable identification

Age, absences variables has many unique values and It is hard to explain. So it was removed from original data. Then, df1 is our final dataset to use in market basket analysis. We have 28 variables as lefthandside, and 1 variable as righthandside.

Left hand sides:

1.school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
2.sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3.address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
4.famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
5.Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
6.Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
7.Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
8.Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
9.Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10.reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
11.guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
12.traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
13.studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
14.failures - number of past class failures (numeric: n if 1<=n<3, else 4)
15.schoolsup - extra educational support (binary: yes or no)
16.famsup - family educational support (binary: yes or no)
17.paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
18.activities - extra-curricular activities (binary: yes or no)
19.nursery - attended nursery school (binary: yes or no)
20.higher - wants to take higher education (binary: yes or no)
21.internet - Internet access at home (binary: yes or no)
22.romantic - with a romantic relationship (binary: yes or no)
23.famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
24.freetime - free time after school (numeric: from 1 - very low to 5 - very high)
25.goout - going out with friends (numeric: from 1 - very low to 5 - very high)
26.Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
27.Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
28.health - current health status (numeric: from 1 - very bad to 5 - very good)

Right hand sides:

1.final - (success, fail)

df<-df %>% select(-age,-absences)
write.csv(df,"df1.csv",row.names = F)
colnames(df)

##  [1] "school"     "sex"        "address"    "famsize"    "Pstatus"   
##  [6] "Medu"       "Fedu"       "Mjob"       "Fjob"       "reason"    
## [11] "guardian"   "traveltime" "studytime"  "failures"   "schoolsup" 
## [16] "famsup"     "paid"       "activities" "nursery"    "higher"    
## [21] "internet"   "romantic"   "famrel"     "freetime"   "goout"     
## [26] "Dalc"       "Walc"       "health"     "final"

Missing values treatment

There is no missing value.

sapply(df,function(x)sum(is.na(x)))

##     school        sex    address    famsize    Pstatus       Medu       Fedu 
##          0          0          0          0          0          0          0 
##       Mjob       Fjob     reason   guardian traveltime  studytime   failures 
##          0          0          0          0          0          0          0 
##  schoolsup     famsup       paid activities    nursery     higher   internet 
##          0          0          0          0          0          0          0 
##   romantic     famrel   freetime      goout       Dalc       Walc     health 
##          0          0          0          0          0          0          0 
##      final 
##          0

Central Tendency

We have 649 observations. There are statistics of numeric variables. Studytime and freetime variables are almost normal distribution. But these are categorical variables. So it is not efficient way to see statistics. So let’s see them as plot.

library(e1071)
library(moments)
tb=data %>% select(Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health)
summary(tb)

##       Medu            Fedu         traveltime      studytime    
##  Min.   :0.000   Min.   :0.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :2.000   Median :2.000   Median :1.000   Median :2.000  
##  Mean   :2.515   Mean   :2.307   Mean   :1.569   Mean   :1.931  
##  3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :4.000  
##     failures          famrel         freetime        goout      
##  Min.   :0.0000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:4.000   1st Qu.:3.00   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :3.00   Median :3.000  
##  Mean   :0.2219   Mean   :3.931   Mean   :3.18   Mean   :3.185  
##  3rd Qu.:0.0000   3rd Qu.:5.000   3rd Qu.:4.00   3rd Qu.:4.000  
##  Max.   :3.0000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##       Dalc            Walc          health     
##  Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.00   1st Qu.:2.000  
##  Median :1.000   Median :2.00   Median :4.000  
##  Mean   :1.502   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.00   Max.   :5.000

tibble(
    Column   = names(tb),
    Variance = purrr::map_dbl(tb, var),
    SD       = purrr::map_dbl(tb, sd),
    IQR      = purrr::map_dbl(tb,IQR),
    SKW      = purrr::map_dbl(tb,skewness),
    KRT      = purrr::map_dbl(tb,kurtosis))

## # A tibble: 11 x 6
##    Column     Variance    SD   IQR      SKW   KRT
##    <chr>         <dbl> <dbl> <dbl>    <dbl> <dbl>
##  1 Medu          1.29  1.13      2 -0.0299   1.74
##  2 Fedu          1.21  1.10      2  0.215    1.89
##  3 traveltime    0.560 0.749     1  1.24     4.09
##  4 studytime     0.688 0.830     1  0.698    3.03
##  5 failures      0.352 0.593     0  3.09    12.7 
##  6 famrel        0.913 0.956     1 -1.10     4.33
##  7 freetime      1.10  1.05      1 -0.181    2.60
##  8 goout         1.38  1.18      2 -0.00856  2.13
##  9 Dalc          0.855 0.925     1  2.14     7.31
## 10 Walc          1.65  1.28      2  0.634    2.23
## 11 health        2.09  1.45      3 -0.499    1.88

Visualization of EDA

There are some result from below graphics.

Mother and Father’s education level’s median is similar 2. But mothers have more education level.
The most students spend less than 15 minut from school to home.The most students spend 2 -5 hours to study in one week.
Most student has good or excellent quality of family relationship. Unfortunatelly, there are a few students who has bad quality of family relationship.
Most students has average free time after school and average go out with friends.
Most students don’t use alcohol during workday. But some students use alcohol during weekend. Also, all students has good health status.

par(mfrow = c(2,2)) # two rows, one column
boxplot(data$Medu,data$Fedu,names = c('Medu','Fedu'))
boxplot(data$traveltime,data$studytime,data$failures,names=c("traveltime",'studytime','failures'))
boxplot(data$famrel,names=c('famrel'))
boxplot(data$freetime,data$goout,data$Dalc,data$Walc,data$health,names=c("freetime",'goout','Dalc','Walc','health'))

Manually categorised

In this section, we do market basket analysis with manually categorised data. Data has 650 itemsets and 126 items. Top 5 most frequent items are paid-no,schoolsup-no, higher-yes, Pstatus-T,failures-0.

Paid-no: no extra paid classes with in the course subject.
Schoolsup-no: no extra educational support
Higher-yes: wants to take higher education
Pstatus-T: living together with parents
Failures-0: most students didn’t fail any lesson

I removed rare observations that has higher itemFrequency than 0.05. After it, I have 87 items. Let’s see most high itemFrequency 15 itemsets as barchart.

trans=read.transactions("df1.csv",format='basket',sep=",",skip=0)
summary(trans)

## transactions as itemMatrix in sparse format with
##  650 rows (elements/itemsets/transactions) and
##  126 columns (items) and a density of 0.2301587 
## 
## most frequent items:
##      paid-no schoolsup-no   higher-yes    Pstatus-T   failures-0      (Other) 
##          610          581          580          569          549        15961 
## 
## element (itemset/transaction) length distribution:
## sizes
##  29 
## 650 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      29      29      29      29      29      29 
## 
## includes extended item information - examples:
##           labels
## 1     activities
## 2  activities-no
## 3 activities-yes

# cleaning the data from rare observations
trans1=trans[,itemFrequency(trans)>0.05]
trans1

## transactions in sparse format with
##  650 transactions (rows) and
##  87 items (columns)

# we can get all levels in dataset and their frequency
sort(itemFrequency(trans1, type="relative"),decreasing = T)

##           paid-no      schoolsup-no        higher-yes         Pstatus-T 
##        0.93846154        0.89384615        0.89230769        0.87538462 
##        failures-0       nursery-yes      internet-yes       famsize-GT3 
##        0.84461538        0.80153846        0.76615385        0.70307692 
##   guardian-mother         address-U     final-success            Dalc-1 
##        0.70000000        0.69538462        0.69538462        0.69384615 
##         school-GP       romantic-no        famsup-yes             sex-F 
##        0.65076923        0.63076923        0.61230769        0.58923077 
##        Fjob-other      traveltime-1     activities-no          famrel-4 
##        0.56461538        0.56307692        0.51384615        0.48769231 
##    activities-yes       studytime-2     reason-course             sex-M 
##        0.48461538        0.46923077        0.43846154        0.40923077 
##        Mjob-other         famsup-no        freetime-3          health-5 
##        0.39692308        0.38615385        0.38615385        0.38307692 
##            Walc-1      romantic-yes         school-MS      traveltime-2 
##        0.38000000        0.36769231        0.34769231        0.32769231 
##       studytime-1            Fedu-2           goout-3         address-R 
##        0.32615385        0.32153846        0.31538462        0.30307692 
##        final-fail       famsize-LE3            Medu-2     Fjob-services 
##        0.30307692        0.29538462        0.28615385        0.27846154 
##          famrel-5        freetime-4            Medu-4            Fedu-1 
##        0.27692308        0.27384615        0.26923077        0.26769231 
##   guardian-father       internet-no            Walc-2       reason-home 
##        0.23538462        0.23230769        0.23076923        0.22923077 
##           goout-2            Medu-1 reason-reputation           goout-4 
##        0.22307692        0.22000000        0.22000000        0.21692308 
##            Medu-3     Mjob-services      Mjob-at_home            Fedu-3 
##        0.21384615        0.20923077        0.20769231        0.20153846 
##            Fedu-4        nursery-no          health-3            Dalc-2 
##        0.19692308        0.19692308        0.19076923        0.18615385 
##            Walc-3           goout-5          health-4        freetime-2 
##        0.18461538        0.16923077        0.16615385        0.16461538 
##          famrel-3       studytime-3          health-1            Walc-4 
##        0.15538462        0.14923077        0.13846154        0.13384615 
##         Pstatus-A          health-2      Mjob-teacher      reason-other 
##        0.12307692        0.12000000        0.11076923        0.11076923 
##        failures-1         higher-no        freetime-5     schoolsup-yes 
##        0.10769231        0.10615385        0.10461538        0.10461538 
##      traveltime-3           goout-1       Mjob-health        freetime-1 
##        0.08307692        0.07384615        0.07384615        0.06923077 
##            Walc-5            Dalc-3      Fjob-at_home    guardian-other 
##        0.06923077        0.06615385        0.06461538        0.06307692 
##          paid-yes      Fjob-teacher       studytime-4 
##        0.06000000        0.05538462        0.05384615

itemFrequencyPlot(trans1,type='relative',topN=15)

RHS - student who passed the final exam

In order to see profile of students who are passed final exam, we have to use apriori method. I choose minimum support level 0.1, and minimum confidence level 0.5. Also, righthandside is ‘final-success’. After training model, I sorted result of model by decreasing lift.

Model has 116034 rules
rule length distribution is up to 10
Model’s support is 0.1 and confidence is 0.5
We can see top 15 result of our model. let’s explain first one. If we see somebody who successfully passed final exam, she lives in urban,she didn’t fail any exams, no romantic relationship, study in GP school, no extra educational support, and she is female. support is 0.13. And this case’s frequency of features combination in dataset is 13 percentage. Confidence is 1. It means that for sure (100%). Also, lift is 1.43 and it is higher than 1. It means that we see features appearing together more often than separately.
Consider the second lhs. If we see somebody who successfully passed final exam, she is female, she has no extra educational support, she study in GP school, she wants to take higher education, her family support her, she didn’t fail any course, she has no extra-curricular activities.

rules<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.5), 
                       appearance=list(default="lhs", rhs='final-success'), control=list(verbose=F))
rules.clean<-rules
rules.final<-sort(rules.clean, by="lift", decreasing=TRUE)
summary(rules.final)

## set of 116034 rules
## 
## rule length distribution (lhs + rhs):sizes
##     1     2     3     4     5     6     7     8     9    10 
##     1    64   842  4486 13303 24825 30556 24688 12990  4279 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.000   7.000   6.956   8.000  10.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1000   Min.   :0.5000   Min.   :0.1000   Min.   :0.719  
##  1st Qu.:0.1077   1st Qu.:0.8068   1st Qu.:0.1277   1st Qu.:1.160  
##  Median :0.1215   Median :0.8523   Median :0.1462   Median :1.226  
##  Mean   :0.1360   Mean   :0.8440   Mean   :0.1625   Mean   :1.214  
##  3rd Qu.:0.1477   3rd Qu.:0.8902   3rd Qu.:0.1769   3rd Qu.:1.280  
##  Max.   :0.6954   Max.   :1.0000   Max.   :1.0000   Max.   :1.438  
##      count       
##  Min.   : 65.00  
##  1st Qu.: 70.00  
##  Median : 79.00  
##  Mean   : 88.43  
##  3rd Qu.: 96.00  
##  Max.   :452.00  
## 
## mining info:
##    data ntransactions support confidence
##  trans1           650     0.1        0.5
##                                                                                                                                                      call
##  apriori(data = trans1, parameter = list(supp = 0.1, conf = 0.5), appearance = list(default = "lhs", rhs = "final-success"), control = list(verbose = F))

rules.final<-sort(rules.final, by="lift", decreasing=TRUE)
inspect(head(rules.final,15))

##      lhs                   rhs               support confidence  coverage     lift count
## [1]  {address-U,                                                                        
##       failures-0,                                                                       
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1323077          1 0.1323077 1.438053    86
## [2]  {activities-no,                                                                    
##       failures-0,                                                                       
##       famsup-yes,                                                                       
##       higher-yes,                                                                       
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1046154          1 0.1046154 1.438053    68
## [3]  {address-U,                                                                        
##       failures-0,                                                                       
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F,                                                                            
##       traveltime-1}     => {final-success} 0.1015385          1 0.1015385 1.438053    66
## [4]  {address-U,                                                                        
##       higher-yes,                                                                       
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F,                                                                            
##       traveltime-1}     => {final-success} 0.1046154          1 0.1046154 1.438053    68
## [5]  {address-U,                                                                        
##       failures-0,                                                                       
##       famsup-yes,                                                                       
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1000000          1 0.1000000 1.438053    65
## [6]  {address-U,                                                                        
##       Dalc-1,                                                                           
##       failures-0,                                                                       
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1092308          1 0.1092308 1.438053    71
## [7]  {address-U,                                                                        
##       failures-0,                                                                       
##       guardian-mother,                                                                  
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1015385          1 0.1015385 1.438053    66
## [8]  {address-U,                                                                        
##       famsize-GT3,                                                                      
##       higher-yes,                                                                       
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1015385          1 0.1015385 1.438053    66
## [9]  {address-U,                                                                        
##       failures-0,                                                                       
##       internet-yes,                                                                     
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1123077          1 0.1123077 1.438053    73
## [10] {address-U,                                                                        
##       failures-0,                                                                       
##       nursery-yes,                                                                      
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1061538          1 0.1061538 1.438053    69
## [11] {address-U,                                                                        
##       failures-0,                                                                       
##       Pstatus-T,                                                                        
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1200000          1 0.1200000 1.438053    78
## [12] {address-U,                                                                        
##       failures-0,                                                                       
##       higher-yes,                                                                       
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1307692          1 0.1307692 1.438053    85
## [13] {address-U,                                                                        
##       failures-0,                                                                       
##       paid-no,                                                                          
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1292308          1 0.1292308 1.438053    84
## [14] {address-U,                                                                        
##       higher-yes,                                                                       
##       Pstatus-T,                                                                        
##       romantic-no,                                                                      
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1230769          1 0.1230769 1.438053    80
## [15] {activities-no,                                                                    
##       failures-0,                                                                       
##       famsup-yes,                                                                       
##       higher-yes,                                                                       
##       paid-no,                                                                          
##       school-GP,                                                                        
##       schoolsup-no,                                                                     
##       sex-F}            => {final-success} 0.1000000          1 0.1000000 1.438053    65

plot(head(rules.final,10), method="graph", engine="htmlwidget") #

aff.items<-dissimilarity(trans1, which="items", method="affinity")
hc<-hclust(aff.items, method="ward.D2")
plot(hc, main="Dendrogram for Items")

RHS - student who failed the final exam

In order to see profile of students who are failed the final exam, we use apriori method. I choose minimum support level 0.1, and minimum confidence level 0.5. Also, right hand side is ‘final-fail’. I sorted result of model by decreasing lift.

Model has 32 rules
rule length distribution is from 3 to 6
Model’s support is 0.1 and confidence is 0.5
We can see top 10 result of our model. let’s explain first one. If we see somebody who failed final exam, he/she choose this school because of courses, he/she study in MS school, and he/she no extra educational support. Support is 0.1. And this case’s frequency of features combination in dataset is 10 percentage. Confidence is 0.57. It means that for sure (57%). Also, lift is 1.91 and it is higher than 1. It means that we see features appearing together more often than separately.
Let’s explain the third lhs. If we see somebody who failed final exam, he/she study in MS school, his/her guardian is his/her mother, he/she no extra paid classes within the course subject , he/she lives together with her parents.

rules<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.5), 
               appearance=list(default="lhs", rhs='final-fail'), control=list(verbose=F))
rules.clean<-rules
rules.final<-sort(rules.clean, by="lift", decreasing=TRUE)
summary(rules.final)

## set of 32 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3  4  5  6 
##  6 14  9  3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   4.000   4.000   4.281   5.000   6.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift      
##  Min.   :0.1000   Min.   :0.5000   Min.   :0.1754   Min.   :1.650  
##  1st Qu.:0.1058   1st Qu.:0.5034   1st Qu.:0.1950   1st Qu.:1.661  
##  Median :0.1154   Median :0.5294   Median :0.2208   Median :1.747  
##  Mean   :0.1172   Mean   :0.5284   Mean   :0.2229   Mean   :1.743  
##  3rd Qu.:0.1235   3rd Qu.:0.5456   3rd Qu.:0.2365   3rd Qu.:1.800  
##  Max.   :0.1646   Max.   :0.5789   Max.   :0.3292   Max.   :1.910  
##      count       
##  Min.   : 65.00  
##  1st Qu.: 68.75  
##  Median : 75.00  
##  Mean   : 76.19  
##  3rd Qu.: 80.25  
##  Max.   :107.00  
## 
## mining info:
##    data ntransactions support confidence
##  trans1           650     0.1        0.5
##                                                                                                                                                   call
##  apriori(data = trans1, parameter = list(supp = 0.1, conf = 0.5), appearance = list(default = "lhs", rhs = "final-fail"), control = list(verbose = F))

rules.final<-sort(rules.final, by="lift", decreasing=TRUE)
inspect(head(rules.final,10))

##      lhs                   rhs            support confidence  coverage     lift count
## [1]  {reason-course,                                                                 
##       school-MS,                                                                     
##       schoolsup-no}     => {final-fail} 0.1015385  0.5789474 0.1753846 1.910232    66
## [2]  {reason-course,                                                                 
##       school-MS}        => {final-fail} 0.1030769  0.5677966 0.1815385 1.873441    67
## [3]  {guardian-mother,                                                               
##       paid-no,                                                                       
##       Pstatus-T,                                                                     
##       school-MS}        => {final-fail} 0.1046154  0.5528455 0.1892308 1.824110    68
## [4]  {guardian-mother,                                                               
##       Pstatus-T,                                                                     
##       school-MS}        => {final-fail} 0.1107692  0.5496183 0.2015385 1.813461    72
## [5]  {famsize-GT3,                                                                   
##       paid-no,                                                                       
##       school-MS,                                                                     
##       schoolsup-no}     => {final-fail} 0.1138462  0.5481481 0.2076923 1.808611    74
## [6]  {famsize-GT3,                                                                   
##       school-MS,                                                                     
##       schoolsup-no}     => {final-fail} 0.1230769  0.5479452 0.2246154 1.807941    80
## [7]  {guardian-mother,                                                               
##       Pstatus-T,                                                                     
##       school-MS,                                                                     
##       schoolsup-no}     => {final-fail} 0.1061538  0.5476190 0.1938462 1.806865    69
## [8]  {guardian-mother,                                                               
##       paid-no,                                                                       
##       Pstatus-T,                                                                     
##       school-MS,                                                                     
##       schoolsup-no}     => {final-fail} 0.1000000  0.5462185 0.1830769 1.802244    65
## [9]  {famsize-GT3,                                                                   
##       nursery-yes,                                                                   
##       school-MS}        => {final-fail} 0.1015385  0.5454545 0.1861538 1.799723    66
## [10] {famsup-yes,                                                                    
##       school-MS,                                                                     
##       schoolsup-no}     => {final-fail} 0.1000000  0.5416667 0.1846154 1.787225    65

plot(head(rules.final,10), method="graph", engine="htmlwidget") #

aff.items<-dissimilarity(trans1, which="items", method="affinity")
hc<-hclust(aff.items, method="ward.D2")
plot(hc, main="Dendrogram for Items")

Automatically categorised

In order to do market basket analysis with automatically categorised data, we need to convert all datas to factor.

RHS - student who passed the final exam

We have 452 dataset and we can see summary of them. I removed rare observations that has higher itemFrequency than 0.05. I choose minimum support level 0.1, and minimum confidence level 0.5.

We have 36813 rules
rule length distributions are from 1 to 5
2nd lhs: If we see who successfully passed final exam, who wants to take higher education
3rd lhs: If we see who successfully passed final exam, who didn’t fail any exams
5th lhs: If we see who successfully passed final exam, who wants to take higher education and who didn’t fail any exam.

library(arulesCBA)
trans2=df
for (i in 1:29) {
  trans2[,i]<-factor(trans2[,i],levels = unique(trans2[,i]))
}
trans21=trans2 %>% filter(final=='final-success')
data.disc=discretizeDF.supervised(data=trans21,final~.,methods = 'chi2')
summary(data.disc)

##        school       sex           address           famsize         Pstatus   
##  school-GP:338   sex-F:281   address-U:335   famsize-GT3:317   Pstatus-A: 56  
##  school-MS:114   sex-M:171   address-R:117   famsize-LE3:135   Pstatus-T:396  
##                                                                               
##                                                                               
##                                                                               
##      Medu         Fedu                Mjob                Fjob    
##  Medu-4:151   Fedu-4:107   Mjob-at_home : 78   Fjob-teacher : 33  
##  Medu-1: 75   Fedu-1: 95   Mjob-health  : 36   Fjob-other   :261  
##  Medu-3: 94   Fedu-2:142   Mjob-other   :176   Fjob-services:117  
##  Medu-2:127   Fedu-3:103   Mjob-services: 98   Fjob-health  : 16  
##  Medu-0:  5   Fedu-0:  5   Mjob-teacher : 64   Fjob-at_home : 25  
##                reason               guardian          traveltime 
##  reason-course    :176   guardian-mother:310   traveltime-2:137  
##  reason-other     : 44   guardian-father:118   traveltime-1:273  
##  reason-home      :116   guardian-other : 24   traveltime-3: 33  
##  reason-reputation:116                         traveltime-4:  9  
##                                                                  
##        studytime         failures           schoolsup          famsup   
##  studytime-2:221   failures-0:433   schoolsup-yes: 45   famsup-no :169  
##  studytime-3: 82   failures-3:  1   schoolsup-no :407   famsup-yes:283  
##  studytime-1:121   failures-1: 15                                       
##  studytime-4: 28   failures-2:  3                                       
##                                                                         
##        paid              activities         nursery           higher   
##  paid-no :425   activities-no :224   nursery-yes:365   higher-yes:435  
##  paid-yes: 27   activities-yes:228   nursery-no : 87   higher-no : 17  
##                                                                        
##                                                                        
##                                                                        
##          internet           romantic        famrel          freetime  
##  internet-no : 90   romantic-no :300   famrel-4:234   freetime-3:184  
##  internet-yes:362   romantic-yes:152   famrel-5:120   freetime-2: 87  
##                                        famrel-3: 68   freetime-4:115  
##                                        famrel-1: 11   freetime-1: 29  
##                                        famrel-2: 19   freetime-5: 37  
##      goout         Dalc         Walc          health              final    
##  goout-4: 99   Dalc-1:334   Walc-1:183   health-3: 88   final-success:452  
##  goout-3:150   Dalc-2: 77   Walc-3: 83   health-5:165   final-fail   :  0  
##  goout-2:110   Dalc-5:  8   Walc-2:110   health-1: 65                      
##  goout-1: 29   Dalc-3: 22   Walc-4: 51   health-2: 55                      
##  goout-5: 64   Dalc-4: 11   Walc-5: 25   health-4: 79

data.trans<-transactions(data.disc)
trans2<-data.trans[, itemFrequency(data.trans)>0.05]
data.ass<-mineCARs(final~ ., transactions=trans2, support=0.1, confidence=0.5)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE           FALSE       5     0.1      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 45 
## 
## set item appearances ...[83 item(s)] done [0.00s].
## set transactions ...[83 item(s), 452 transaction(s)] done [0.00s].
## sorting and recoding items ... [83 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.22s].
## writing ... [36813 rule(s)] done [0.02s].
## creating S4 object  ... done [0.02s].

data.ass=sort(data.ass,by='support',decreasing = TRUE)
summary(data.ass)

## set of 36813 rules
## 
## rule length distribution (lhs + rhs):sizes
##     1     2     3     4     5 
##     1    69  1171  7845 27727 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   5.000   4.718   5.000   5.000 
## 
## summary of quality measures:
##     support         confidence    coverage           lift       count       
##  Min.   :0.1018   Min.   :1    Min.   :0.1018   Min.   :1   Min.   : 46.00  
##  1st Qu.:0.1173   1st Qu.:1    1st Qu.:0.1173   1st Qu.:1   1st Qu.: 53.00  
##  Median :0.1438   Median :1    Median :0.1438   Median :1   Median : 65.00  
##  Mean   :0.1756   Mean   :1    Mean   :0.1756   Mean   :1   Mean   : 79.35  
##  3rd Qu.:0.1969   3rd Qu.:1    3rd Qu.:0.1969   3rd Qu.:1   3rd Qu.: 89.00  
##  Max.   :1.0000   Max.   :1    Max.   :1.0000   Max.   :1   Max.   :452.00  
## 
## mining info:
##          data ntransactions support confidence
##  transactions           452     0.1        0.5
##                                                                                                                                         call
##  apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)

inspect(head(data.ass,20))

##      lhs                          rhs                     support confidence  coverage lift count
## [1]  {}                        => {final=final-success} 1.0000000          1 1.0000000    1   452
## [2]  {higher=higher-yes}       => {final=final-success} 0.9623894          1 0.9623894    1   435
## [3]  {failures=failures-0}     => {final=final-success} 0.9579646          1 0.9579646    1   433
## [4]  {paid=paid-no}            => {final=final-success} 0.9402655          1 0.9402655    1   425
## [5]  {failures=failures-0,                                                                       
##       higher=higher-yes}       => {final=final-success} 0.9269912          1 0.9269912    1   419
## [6]  {failures=failures-0,                                                                       
##       paid=paid-no}            => {final=final-success} 0.9048673          1 0.9048673    1   409
## [7]  {paid=paid-no,                                                                              
##       higher=higher-yes}       => {final=final-success} 0.9048673          1 0.9048673    1   409
## [8]  {schoolsup=schoolsup-no}  => {final=final-success} 0.9004425          1 0.9004425    1   407
## [9]  {Pstatus=Pstatus-T}       => {final=final-success} 0.8761062          1 0.8761062    1   396
## [10] {failures=failures-0,                                                                       
##       paid=paid-no,                                                                              
##       higher=higher-yes}       => {final=final-success} 0.8761062          1 0.8761062    1   396
## [11] {failures=failures-0,                                                                       
##       schoolsup=schoolsup-no}  => {final=final-success} 0.8650442          1 0.8650442    1   391
## [12] {schoolsup=schoolsup-no,                                                                    
##       higher=higher-yes}       => {final=final-success} 0.8628319          1 0.8628319    1   390
## [13] {schoolsup=schoolsup-no,                                                                    
##       paid=paid-no}            => {final=final-success} 0.8517699          1 0.8517699    1   385
## [14] {Pstatus=Pstatus-T,                                                                         
##       higher=higher-yes}       => {final=final-success} 0.8429204          1 0.8429204    1   381
## [15] {Pstatus=Pstatus-T,                                                                         
##       failures=failures-0}     => {final=final-success} 0.8407080          1 0.8407080    1   380
## [16] {failures=failures-0,                                                                       
##       schoolsup=schoolsup-no,                                                                    
##       higher=higher-yes}       => {final=final-success} 0.8340708          1 0.8340708    1   377
## [17] {Pstatus=Pstatus-T,                                                                         
##       paid=paid-no}            => {final=final-success} 0.8230088          1 0.8230088    1   372
## [18] {failures=failures-0,                                                                       
##       schoolsup=schoolsup-no,                                                                    
##       paid=paid-no}            => {final=final-success} 0.8230088          1 0.8230088    1   372
## [19] {schoolsup=schoolsup-no,                                                                    
##       paid=paid-no,                                                                              
##       higher=higher-yes}       => {final=final-success} 0.8163717          1 0.8163717    1   369
## [20] {Pstatus=Pstatus-T,                                                                         
##       failures=failures-0,                                                                       
##       higher=higher-yes}       => {final=final-success} 0.8119469          1 0.8119469    1   367

RHS - student who failed the final exam

We have 197 dataset and we can see summary of them. I removed rare observations that has higher itemFrequency than 0.05. I choose minimum support level 0.1, and minimum confidence level 0.5.

We have 30188 rules
rule length distributions are from 1 to 5
2nd lhs: If we see who failed final exam, who don’t take extra paid classes within the course subject
3rd lhs: If we see who failed final exam, who don’t have extra educational support
4th lhs: If we see who failed final exam, who lives together with parents.
5th lhs: If we see who failed final exam, who don’t take extra paid classes within the course subject and who don’t have extra educational support.

library(arulesCBA)
trans2=df
for (i in 1:29) {
  trans2[,i]<-factor(trans2[,i],levels = unique(trans2[,i]))
}
trans22=trans2 %>% filter(final=='final-fail')
data.disc=discretizeDF.supervised(data=trans22,final~.,methods = 'chi2')
summary(data.disc)

##        school       sex           address           famsize         Pstatus   
##  school-GP: 85   sex-F:102   address-U:117   famsize-GT3:140   Pstatus-A: 24  
##  school-MS:112   sex-M: 95   address-R: 80   famsize-LE3: 57   Pstatus-T:173  
##                                                                               
##                                                                               
##                                                                               
##      Medu        Fedu               Mjob               Fjob    
##  Medu-4:24   Fedu-4:21   Mjob-at_home :57   Fjob-teacher :  3  
##  Medu-1:68   Fedu-1:79   Mjob-health  :12   Fjob-other   :106  
##  Medu-3:45   Fedu-2:67   Mjob-other   :82   Fjob-services: 64  
##  Medu-2:59   Fedu-3:28   Mjob-services:38   Fjob-health  :  7  
##  Medu-0: 1   Fedu-0: 2   Mjob-teacher : 8   Fjob-at_home : 17  
##                reason               guardian          traveltime
##  reason-course    :109   guardian-mother:145   traveltime-2:76  
##  reason-other     : 28   guardian-father: 35   traveltime-1:93  
##  reason-home      : 33   guardian-other : 17   traveltime-3:21  
##  reason-reputation: 27                         traveltime-4: 7  
##                                                                 
##        studytime        failures           schoolsup          famsup   
##  studytime-2:84   failures-0:116   schoolsup-yes: 23   famsup-no : 82  
##  studytime-3:15   failures-3: 13   schoolsup-no :174   famsup-yes:115  
##  studytime-1:91   failures-1: 55                                       
##  studytime-4: 7   failures-2: 13                                       
##                                                                        
##        paid              activities         nursery           higher   
##  paid-no :185   activities-no :110   nursery-yes:156   higher-yes:145  
##  paid-yes: 12   activities-yes: 87   nursery-no : 41   higher-no : 52  
##                                                                        
##                                                                        
##                                                                        
##          internet           romantic        famrel         freetime 
##  internet-no : 61   romantic-no :110   famrel-4:83   freetime-3:67  
##  internet-yes:136   romantic-yes: 87   famrel-5:60   freetime-2:20  
##                                        famrel-3:33   freetime-4:63  
##                                        famrel-1:11   freetime-1:16  
##                                        famrel-2:10   freetime-5:31  
##      goout        Dalc         Walc         health             final    
##  goout-4:42   Dalc-1:117   Walc-1:64   health-3:36   final-success:  0  
##  goout-3:55   Dalc-2: 44   Walc-3:37   health-5:84   final-fail   :197  
##  goout-2:35   Dalc-5:  9   Walc-2:40   health-1:25                      
##  goout-1:19   Dalc-3: 21   Walc-4:36   health-2:23                      
##  goout-5:46   Dalc-4:  6   Walc-5:20   health-4:29

data.trans<-transactions(data.disc)
trans2<-data.trans[, itemFrequency(data.trans)>0.05]
data.ass<-mineCARs(final~ ., transactions=trans2, support=0.1, confidence=0.5)

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE           FALSE       5     0.1      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 19 
## 
## set item appearances ...[87 item(s)] done [0.00s].
## set transactions ...[87 item(s), 197 transaction(s)] done [0.00s].
## sorting and recoding items ... [87 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.10s].
## writing ... [30188 rule(s)] done [0.00s].
## creating S4 object  ... done [0.03s].

data.ass=sort(data.ass,by='support',decreasing = TRUE)
summary(data.ass)

## set of 30188 rules
## 
## rule length distribution (lhs + rhs):sizes
##     1     2     3     4     5 
##     1    75  1274  7822 21016 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   4.000   5.000   4.649   5.000   5.000 
## 
## summary of quality measures:
##     support         confidence    coverage           lift       count       
##  Min.   :0.1015   Min.   :1    Min.   :0.1015   Min.   :1   Min.   : 20.00  
##  1st Qu.:0.1117   1st Qu.:1    1st Qu.:0.1117   1st Qu.:1   1st Qu.: 22.00  
##  Median :0.1320   Median :1    Median :0.1320   Median :1   Median : 26.00  
##  Mean   :0.1529   Mean   :1    Mean   :0.1529   Mean   :1   Mean   : 30.12  
##  3rd Qu.:0.1726   3rd Qu.:1    3rd Qu.:0.1726   3rd Qu.:1   3rd Qu.: 34.00  
##  Max.   :1.0000   Max.   :1    Max.   :1.0000   Max.   :1   Max.   :197.00  
## 
## mining info:
##          data ntransactions support confidence
##  transactions           197     0.1        0.5
##                                                                                                                                         call
##  apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)

inspect(head(data.ass,20))

##      lhs                            rhs                  support confidence  coverage lift count
## [1]  {}                          => {final=final-fail} 1.0000000          1 1.0000000    1   197
## [2]  {paid=paid-no}              => {final=final-fail} 0.9390863          1 0.9390863    1   185
## [3]  {schoolsup=schoolsup-no}    => {final=final-fail} 0.8832487          1 0.8832487    1   174
## [4]  {Pstatus=Pstatus-T}         => {final=final-fail} 0.8781726          1 0.8781726    1   173
## [5]  {schoolsup=schoolsup-no,                                                                   
##       paid=paid-no}              => {final=final-fail} 0.8274112          1 0.8274112    1   163
## [6]  {Pstatus=Pstatus-T,                                                                        
##       paid=paid-no}              => {final=final-fail} 0.8223350          1 0.8223350    1   162
## [7]  {nursery=nursery-yes}       => {final=final-fail} 0.7918782          1 0.7918782    1   156
## [8]  {Pstatus=Pstatus-T,                                                                        
##       schoolsup=schoolsup-no}    => {final=final-fail} 0.7664975          1 0.7664975    1   151
## [9]  {paid=paid-no,                                                                             
##       nursery=nursery-yes}       => {final=final-fail} 0.7411168          1 0.7411168    1   146
## [10] {higher=higher-yes}         => {final=final-fail} 0.7360406          1 0.7360406    1   145
## [11] {guardian=guardian-mother}  => {final=final-fail} 0.7360406          1 0.7360406    1   145
## [12] {Pstatus=Pstatus-T,                                                                        
##       schoolsup=schoolsup-no,                                                                   
##       paid=paid-no}              => {final=final-fail} 0.7157360          1 0.7157360    1   141
## [13] {famsize=famsize-GT3}       => {final=final-fail} 0.7106599          1 0.7106599    1   140
## [14] {Pstatus=Pstatus-T,                                                                        
##       nursery=nursery-yes}       => {final=final-fail} 0.7055838          1 0.7055838    1   139
## [15] {schoolsup=schoolsup-no,                                                                   
##       nursery=nursery-yes}       => {final=final-fail} 0.6954315          1 0.6954315    1   137
## [16] {internet=internet-yes}     => {final=final-fail} 0.6903553          1 0.6903553    1   136
## [17] {guardian=guardian-mother,                                                                 
##       paid=paid-no}              => {final=final-fail} 0.6903553          1 0.6903553    1   136
## [18] {paid=paid-no,                                                                             
##       higher=higher-yes}         => {final=final-fail} 0.6852792          1 0.6852792    1   135
## [19] {famsize=famsize-GT3,                                                                      
##       paid=paid-no}              => {final=final-fail} 0.6598985          1 0.6598985    1   130
## [20] {Pstatus=Pstatus-T,                                                                        
##       guardian=guardian-mother}  => {final=final-fail} 0.6598985          1 0.6598985    1   130

Conclusion

Student with the following characteristics has successfully passed the final exam.

female
lives in urban
no romantic relationship
study in GP school
wants to take higher education
her family support him
didn’t fail any exams

However, Students with the following characteristics failed to pass the final exam.

study in MS school* no extra paid classes with in the course subject
no extra educational support

Market basket analysis

Sugarbayar Enkhbayar

2/10/2023

Introduction

Packages

Data preparetion

Exploratory Data Analysis

Variable identification

Missing values treatment

Central Tendency

Visualization of EDA

Manually categorised

RHS - student who passed the final exam

RHS - student who failed the final exam

Automatically categorised

RHS - student who passed the final exam

RHS - student who failed the final exam

Conclusion