I used market basket analysis with student performance data from 2 schools in Portugal to find out what factors influence whether students pass or not final exam. In other words, I selected student pass or fail data as rhs. And I selected student other profile information as lhs.
library(readr)
library(tidyverse)
library(arules)
library(arulesViz)
Firstly, I imported data from machine learning repository website. I removed X rownumber variable from data. Then, I added column names to values to make result of association rule easier to interpret.
df=read.csv("df.csv")
df=df %>% select(-X)
data=df
for (i in 1:31) {
x=colnames(df)[i]
df[,i]<-paste0(x,'-',df[,i])
}
head(df)
## school sex age address famsize Pstatus Medu Fedu
## 1 school-GP sex-F age-18 address-U famsize-GT3 Pstatus-A Medu-4 Fedu-4
## 2 school-GP sex-F age-17 address-U famsize-GT3 Pstatus-T Medu-1 Fedu-1
## 3 school-GP sex-F age-15 address-U famsize-LE3 Pstatus-T Medu-1 Fedu-1
## 4 school-GP sex-F age-15 address-U famsize-GT3 Pstatus-T Medu-4 Fedu-2
## 5 school-GP sex-F age-16 address-U famsize-GT3 Pstatus-T Medu-3 Fedu-3
## 6 school-GP sex-M age-16 address-U famsize-LE3 Pstatus-T Medu-4 Fedu-3
## Mjob Fjob reason guardian traveltime
## 1 Mjob-at_home Fjob-teacher reason-course guardian-mother traveltime-2
## 2 Mjob-at_home Fjob-other reason-course guardian-father traveltime-1
## 3 Mjob-at_home Fjob-other reason-other guardian-mother traveltime-1
## 4 Mjob-health Fjob-services reason-home guardian-mother traveltime-1
## 5 Mjob-other Fjob-other reason-home guardian-father traveltime-1
## 6 Mjob-services Fjob-other reason-reputation guardian-mother traveltime-1
## studytime failures schoolsup famsup paid activities
## 1 studytime-2 failures-0 schoolsup-yes famsup-no paid-no activities-no
## 2 studytime-2 failures-0 schoolsup-no famsup-yes paid-no activities-no
## 3 studytime-2 failures-0 schoolsup-yes famsup-no paid-no activities-no
## 4 studytime-3 failures-0 schoolsup-no famsup-yes paid-no activities-yes
## 5 studytime-2 failures-0 schoolsup-no famsup-yes paid-no activities-no
## 6 studytime-2 failures-0 schoolsup-no famsup-yes paid-no activities-yes
## nursery higher internet romantic famrel freetime goout
## 1 nursery-yes higher-yes internet-no romantic-no famrel-4 freetime-3 goout-4
## 2 nursery-no higher-yes internet-yes romantic-no famrel-5 freetime-3 goout-3
## 3 nursery-yes higher-yes internet-yes romantic-no famrel-4 freetime-3 goout-2
## 4 nursery-yes higher-yes internet-yes romantic-yes famrel-3 freetime-2 goout-2
## 5 nursery-yes higher-yes internet-no romantic-no famrel-4 freetime-3 goout-2
## 6 nursery-yes higher-yes internet-yes romantic-no famrel-5 freetime-4 goout-2
## Dalc Walc health absences final
## 1 Dalc-1 Walc-1 health-3 absences-4 final-success
## 2 Dalc-1 Walc-1 health-3 absences-2 final-success
## 3 Dalc-2 Walc-3 health-3 absences-6 final-success
## 4 Dalc-1 Walc-1 health-5 absences-0 final-success
## 5 Dalc-1 Walc-2 health-5 absences-0 final-success
## 6 Dalc-1 Walc-2 health-5 absences-6 final-success
Age, absences variables has many unique values and It is hard to explain. So it was removed from original data. Then, df1 is our final dataset to use in market basket analysis. We have 28 variables as lefthandside, and 1 variable as righthandside.
Left hand sides:
Right hand sides:
df<-df %>% select(-age,-absences)
write.csv(df,"df1.csv",row.names = F)
colnames(df)
## [1] "school" "sex" "address" "famsize" "Pstatus"
## [6] "Medu" "Fedu" "Mjob" "Fjob" "reason"
## [11] "guardian" "traveltime" "studytime" "failures" "schoolsup"
## [16] "famsup" "paid" "activities" "nursery" "higher"
## [21] "internet" "romantic" "famrel" "freetime" "goout"
## [26] "Dalc" "Walc" "health" "final"
There is no missing value.
sapply(df,function(x)sum(is.na(x)))
## school sex address famsize Pstatus Medu Fedu
## 0 0 0 0 0 0 0
## Mjob Fjob reason guardian traveltime studytime failures
## 0 0 0 0 0 0 0
## schoolsup famsup paid activities nursery higher internet
## 0 0 0 0 0 0 0
## romantic famrel freetime goout Dalc Walc health
## 0 0 0 0 0 0 0
## final
## 0
We have 649 observations. There are statistics of numeric variables. Studytime and freetime variables are almost normal distribution. But these are categorical variables. So it is not efficient way to see statistics. So let’s see them as plot.
library(e1071)
library(moments)
tb=data %>% select(Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,Dalc,Walc,health)
summary(tb)
## Medu Fedu traveltime studytime
## Min. :0.000 Min. :0.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :1.000 Median :2.000
## Mean :2.515 Mean :2.307 Mean :1.569 Mean :1.931
## 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :4.000 Max. :4.000 Max. :4.000
## failures famrel freetime goout
## Min. :0.0000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:4.000 1st Qu.:3.00 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :3.00 Median :3.000
## Mean :0.2219 Mean :3.931 Mean :3.18 Mean :3.185
## 3rd Qu.:0.0000 3rd Qu.:5.000 3rd Qu.:4.00 3rd Qu.:4.000
## Max. :3.0000 Max. :5.000 Max. :5.00 Max. :5.000
## Dalc Walc health
## Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
## Median :1.000 Median :2.00 Median :4.000
## Mean :1.502 Mean :2.28 Mean :3.536
## 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.00 Max. :5.000
tibble(
Column = names(tb),
Variance = purrr::map_dbl(tb, var),
SD = purrr::map_dbl(tb, sd),
IQR = purrr::map_dbl(tb,IQR),
SKW = purrr::map_dbl(tb,skewness),
KRT = purrr::map_dbl(tb,kurtosis))
## # A tibble: 11 x 6
## Column Variance SD IQR SKW KRT
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Medu 1.29 1.13 2 -0.0299 1.74
## 2 Fedu 1.21 1.10 2 0.215 1.89
## 3 traveltime 0.560 0.749 1 1.24 4.09
## 4 studytime 0.688 0.830 1 0.698 3.03
## 5 failures 0.352 0.593 0 3.09 12.7
## 6 famrel 0.913 0.956 1 -1.10 4.33
## 7 freetime 1.10 1.05 1 -0.181 2.60
## 8 goout 1.38 1.18 2 -0.00856 2.13
## 9 Dalc 0.855 0.925 1 2.14 7.31
## 10 Walc 1.65 1.28 2 0.634 2.23
## 11 health 2.09 1.45 3 -0.499 1.88
There are some result from below graphics.
par(mfrow = c(2,2)) # two rows, one column
boxplot(data$Medu,data$Fedu,names = c('Medu','Fedu'))
boxplot(data$traveltime,data$studytime,data$failures,names=c("traveltime",'studytime','failures'))
boxplot(data$famrel,names=c('famrel'))
boxplot(data$freetime,data$goout,data$Dalc,data$Walc,data$health,names=c("freetime",'goout','Dalc','Walc','health'))
In this section, we do market basket analysis with manually categorised data. Data has 650 itemsets and 126 items. Top 5 most frequent items are paid-no,schoolsup-no, higher-yes, Pstatus-T,failures-0.
I removed rare observations that has higher itemFrequency than 0.05. After it, I have 87 items. Let’s see most high itemFrequency 15 itemsets as barchart.
trans=read.transactions("df1.csv",format='basket',sep=",",skip=0)
summary(trans)
## transactions as itemMatrix in sparse format with
## 650 rows (elements/itemsets/transactions) and
## 126 columns (items) and a density of 0.2301587
##
## most frequent items:
## paid-no schoolsup-no higher-yes Pstatus-T failures-0 (Other)
## 610 581 580 569 549 15961
##
## element (itemset/transaction) length distribution:
## sizes
## 29
## 650
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 29 29 29 29 29 29
##
## includes extended item information - examples:
## labels
## 1 activities
## 2 activities-no
## 3 activities-yes
# cleaning the data from rare observations
trans1=trans[,itemFrequency(trans)>0.05]
trans1
## transactions in sparse format with
## 650 transactions (rows) and
## 87 items (columns)
# we can get all levels in dataset and their frequency
sort(itemFrequency(trans1, type="relative"),decreasing = T)
## paid-no schoolsup-no higher-yes Pstatus-T
## 0.93846154 0.89384615 0.89230769 0.87538462
## failures-0 nursery-yes internet-yes famsize-GT3
## 0.84461538 0.80153846 0.76615385 0.70307692
## guardian-mother address-U final-success Dalc-1
## 0.70000000 0.69538462 0.69538462 0.69384615
## school-GP romantic-no famsup-yes sex-F
## 0.65076923 0.63076923 0.61230769 0.58923077
## Fjob-other traveltime-1 activities-no famrel-4
## 0.56461538 0.56307692 0.51384615 0.48769231
## activities-yes studytime-2 reason-course sex-M
## 0.48461538 0.46923077 0.43846154 0.40923077
## Mjob-other famsup-no freetime-3 health-5
## 0.39692308 0.38615385 0.38615385 0.38307692
## Walc-1 romantic-yes school-MS traveltime-2
## 0.38000000 0.36769231 0.34769231 0.32769231
## studytime-1 Fedu-2 goout-3 address-R
## 0.32615385 0.32153846 0.31538462 0.30307692
## final-fail famsize-LE3 Medu-2 Fjob-services
## 0.30307692 0.29538462 0.28615385 0.27846154
## famrel-5 freetime-4 Medu-4 Fedu-1
## 0.27692308 0.27384615 0.26923077 0.26769231
## guardian-father internet-no Walc-2 reason-home
## 0.23538462 0.23230769 0.23076923 0.22923077
## goout-2 Medu-1 reason-reputation goout-4
## 0.22307692 0.22000000 0.22000000 0.21692308
## Medu-3 Mjob-services Mjob-at_home Fedu-3
## 0.21384615 0.20923077 0.20769231 0.20153846
## Fedu-4 nursery-no health-3 Dalc-2
## 0.19692308 0.19692308 0.19076923 0.18615385
## Walc-3 goout-5 health-4 freetime-2
## 0.18461538 0.16923077 0.16615385 0.16461538
## famrel-3 studytime-3 health-1 Walc-4
## 0.15538462 0.14923077 0.13846154 0.13384615
## Pstatus-A health-2 Mjob-teacher reason-other
## 0.12307692 0.12000000 0.11076923 0.11076923
## failures-1 higher-no freetime-5 schoolsup-yes
## 0.10769231 0.10615385 0.10461538 0.10461538
## traveltime-3 goout-1 Mjob-health freetime-1
## 0.08307692 0.07384615 0.07384615 0.06923077
## Walc-5 Dalc-3 Fjob-at_home guardian-other
## 0.06923077 0.06615385 0.06461538 0.06307692
## paid-yes Fjob-teacher studytime-4
## 0.06000000 0.05538462 0.05384615
itemFrequencyPlot(trans1,type='relative',topN=15)
In order to see profile of students who are passed final exam, we have to use apriori method. I choose minimum support level 0.1, and minimum confidence level 0.5. Also, righthandside is ‘final-success’. After training model, I sorted result of model by decreasing lift.
rules<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.5),
appearance=list(default="lhs", rhs='final-success'), control=list(verbose=F))
rules.clean<-rules
rules.final<-sort(rules.clean, by="lift", decreasing=TRUE)
summary(rules.final)
## set of 116034 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5 6 7 8 9 10
## 1 64 842 4486 13303 24825 30556 24688 12990 4279
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 6.000 7.000 6.956 8.000 10.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1000 Min. :0.5000 Min. :0.1000 Min. :0.719
## 1st Qu.:0.1077 1st Qu.:0.8068 1st Qu.:0.1277 1st Qu.:1.160
## Median :0.1215 Median :0.8523 Median :0.1462 Median :1.226
## Mean :0.1360 Mean :0.8440 Mean :0.1625 Mean :1.214
## 3rd Qu.:0.1477 3rd Qu.:0.8902 3rd Qu.:0.1769 3rd Qu.:1.280
## Max. :0.6954 Max. :1.0000 Max. :1.0000 Max. :1.438
## count
## Min. : 65.00
## 1st Qu.: 70.00
## Median : 79.00
## Mean : 88.43
## 3rd Qu.: 96.00
## Max. :452.00
##
## mining info:
## data ntransactions support confidence
## trans1 650 0.1 0.5
## call
## apriori(data = trans1, parameter = list(supp = 0.1, conf = 0.5), appearance = list(default = "lhs", rhs = "final-success"), control = list(verbose = F))
rules.final<-sort(rules.final, by="lift", decreasing=TRUE)
inspect(head(rules.final,15))
## lhs rhs support confidence coverage lift count
## [1] {address-U,
## failures-0,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1323077 1 0.1323077 1.438053 86
## [2] {activities-no,
## failures-0,
## famsup-yes,
## higher-yes,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1046154 1 0.1046154 1.438053 68
## [3] {address-U,
## failures-0,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F,
## traveltime-1} => {final-success} 0.1015385 1 0.1015385 1.438053 66
## [4] {address-U,
## higher-yes,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F,
## traveltime-1} => {final-success} 0.1046154 1 0.1046154 1.438053 68
## [5] {address-U,
## failures-0,
## famsup-yes,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1000000 1 0.1000000 1.438053 65
## [6] {address-U,
## Dalc-1,
## failures-0,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1092308 1 0.1092308 1.438053 71
## [7] {address-U,
## failures-0,
## guardian-mother,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1015385 1 0.1015385 1.438053 66
## [8] {address-U,
## famsize-GT3,
## higher-yes,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1015385 1 0.1015385 1.438053 66
## [9] {address-U,
## failures-0,
## internet-yes,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1123077 1 0.1123077 1.438053 73
## [10] {address-U,
## failures-0,
## nursery-yes,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1061538 1 0.1061538 1.438053 69
## [11] {address-U,
## failures-0,
## Pstatus-T,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1200000 1 0.1200000 1.438053 78
## [12] {address-U,
## failures-0,
## higher-yes,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1307692 1 0.1307692 1.438053 85
## [13] {address-U,
## failures-0,
## paid-no,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1292308 1 0.1292308 1.438053 84
## [14] {address-U,
## higher-yes,
## Pstatus-T,
## romantic-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1230769 1 0.1230769 1.438053 80
## [15] {activities-no,
## failures-0,
## famsup-yes,
## higher-yes,
## paid-no,
## school-GP,
## schoolsup-no,
## sex-F} => {final-success} 0.1000000 1 0.1000000 1.438053 65
plot(head(rules.final,10), method="graph", engine="htmlwidget") #
aff.items<-dissimilarity(trans1, which="items", method="affinity")
hc<-hclust(aff.items, method="ward.D2")
plot(hc, main="Dendrogram for Items")
In order to see profile of students who are failed the final exam, we use apriori method. I choose minimum support level 0.1, and minimum confidence level 0.5. Also, right hand side is ‘final-fail’. I sorted result of model by decreasing lift.
rules<-apriori(data=trans1, parameter=list(supp=0.1,conf = 0.5),
appearance=list(default="lhs", rhs='final-fail'), control=list(verbose=F))
rules.clean<-rules
rules.final<-sort(rules.clean, by="lift", decreasing=TRUE)
summary(rules.final)
## set of 32 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 6 14 9 3
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.281 5.000 6.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.1000 Min. :0.5000 Min. :0.1754 Min. :1.650
## 1st Qu.:0.1058 1st Qu.:0.5034 1st Qu.:0.1950 1st Qu.:1.661
## Median :0.1154 Median :0.5294 Median :0.2208 Median :1.747
## Mean :0.1172 Mean :0.5284 Mean :0.2229 Mean :1.743
## 3rd Qu.:0.1235 3rd Qu.:0.5456 3rd Qu.:0.2365 3rd Qu.:1.800
## Max. :0.1646 Max. :0.5789 Max. :0.3292 Max. :1.910
## count
## Min. : 65.00
## 1st Qu.: 68.75
## Median : 75.00
## Mean : 76.19
## 3rd Qu.: 80.25
## Max. :107.00
##
## mining info:
## data ntransactions support confidence
## trans1 650 0.1 0.5
## call
## apriori(data = trans1, parameter = list(supp = 0.1, conf = 0.5), appearance = list(default = "lhs", rhs = "final-fail"), control = list(verbose = F))
rules.final<-sort(rules.final, by="lift", decreasing=TRUE)
inspect(head(rules.final,10))
## lhs rhs support confidence coverage lift count
## [1] {reason-course,
## school-MS,
## schoolsup-no} => {final-fail} 0.1015385 0.5789474 0.1753846 1.910232 66
## [2] {reason-course,
## school-MS} => {final-fail} 0.1030769 0.5677966 0.1815385 1.873441 67
## [3] {guardian-mother,
## paid-no,
## Pstatus-T,
## school-MS} => {final-fail} 0.1046154 0.5528455 0.1892308 1.824110 68
## [4] {guardian-mother,
## Pstatus-T,
## school-MS} => {final-fail} 0.1107692 0.5496183 0.2015385 1.813461 72
## [5] {famsize-GT3,
## paid-no,
## school-MS,
## schoolsup-no} => {final-fail} 0.1138462 0.5481481 0.2076923 1.808611 74
## [6] {famsize-GT3,
## school-MS,
## schoolsup-no} => {final-fail} 0.1230769 0.5479452 0.2246154 1.807941 80
## [7] {guardian-mother,
## Pstatus-T,
## school-MS,
## schoolsup-no} => {final-fail} 0.1061538 0.5476190 0.1938462 1.806865 69
## [8] {guardian-mother,
## paid-no,
## Pstatus-T,
## school-MS,
## schoolsup-no} => {final-fail} 0.1000000 0.5462185 0.1830769 1.802244 65
## [9] {famsize-GT3,
## nursery-yes,
## school-MS} => {final-fail} 0.1015385 0.5454545 0.1861538 1.799723 66
## [10] {famsup-yes,
## school-MS,
## schoolsup-no} => {final-fail} 0.1000000 0.5416667 0.1846154 1.787225 65
plot(head(rules.final,10), method="graph", engine="htmlwidget") #
aff.items<-dissimilarity(trans1, which="items", method="affinity")
hc<-hclust(aff.items, method="ward.D2")
plot(hc, main="Dendrogram for Items")
In order to do market basket analysis with automatically categorised data, we need to convert all datas to factor.
We have 452 dataset and we can see summary of them. I removed rare observations that has higher itemFrequency than 0.05. I choose minimum support level 0.1, and minimum confidence level 0.5.
library(arulesCBA)
trans2=df
for (i in 1:29) {
trans2[,i]<-factor(trans2[,i],levels = unique(trans2[,i]))
}
trans21=trans2 %>% filter(final=='final-success')
data.disc=discretizeDF.supervised(data=trans21,final~.,methods = 'chi2')
summary(data.disc)
## school sex address famsize Pstatus
## school-GP:338 sex-F:281 address-U:335 famsize-GT3:317 Pstatus-A: 56
## school-MS:114 sex-M:171 address-R:117 famsize-LE3:135 Pstatus-T:396
##
##
##
## Medu Fedu Mjob Fjob
## Medu-4:151 Fedu-4:107 Mjob-at_home : 78 Fjob-teacher : 33
## Medu-1: 75 Fedu-1: 95 Mjob-health : 36 Fjob-other :261
## Medu-3: 94 Fedu-2:142 Mjob-other :176 Fjob-services:117
## Medu-2:127 Fedu-3:103 Mjob-services: 98 Fjob-health : 16
## Medu-0: 5 Fedu-0: 5 Mjob-teacher : 64 Fjob-at_home : 25
## reason guardian traveltime
## reason-course :176 guardian-mother:310 traveltime-2:137
## reason-other : 44 guardian-father:118 traveltime-1:273
## reason-home :116 guardian-other : 24 traveltime-3: 33
## reason-reputation:116 traveltime-4: 9
##
## studytime failures schoolsup famsup
## studytime-2:221 failures-0:433 schoolsup-yes: 45 famsup-no :169
## studytime-3: 82 failures-3: 1 schoolsup-no :407 famsup-yes:283
## studytime-1:121 failures-1: 15
## studytime-4: 28 failures-2: 3
##
## paid activities nursery higher
## paid-no :425 activities-no :224 nursery-yes:365 higher-yes:435
## paid-yes: 27 activities-yes:228 nursery-no : 87 higher-no : 17
##
##
##
## internet romantic famrel freetime
## internet-no : 90 romantic-no :300 famrel-4:234 freetime-3:184
## internet-yes:362 romantic-yes:152 famrel-5:120 freetime-2: 87
## famrel-3: 68 freetime-4:115
## famrel-1: 11 freetime-1: 29
## famrel-2: 19 freetime-5: 37
## goout Dalc Walc health final
## goout-4: 99 Dalc-1:334 Walc-1:183 health-3: 88 final-success:452
## goout-3:150 Dalc-2: 77 Walc-3: 83 health-5:165 final-fail : 0
## goout-2:110 Dalc-5: 8 Walc-2:110 health-1: 65
## goout-1: 29 Dalc-3: 22 Walc-4: 51 health-2: 55
## goout-5: 64 Dalc-4: 11 Walc-5: 25 health-4: 79
data.trans<-transactions(data.disc)
trans2<-data.trans[, itemFrequency(data.trans)>0.05]
data.ass<-mineCARs(final~ ., transactions=trans2, support=0.1, confidence=0.5)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE FALSE 5 0.1 1
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 45
##
## set item appearances ...[83 item(s)] done [0.00s].
## set transactions ...[83 item(s), 452 transaction(s)] done [0.00s].
## sorting and recoding items ... [83 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.22s].
## writing ... [36813 rule(s)] done [0.02s].
## creating S4 object ... done [0.02s].
data.ass=sort(data.ass,by='support',decreasing = TRUE)
summary(data.ass)
## set of 36813 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5
## 1 69 1171 7845 27727
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 5.000 4.718 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift count
## Min. :0.1018 Min. :1 Min. :0.1018 Min. :1 Min. : 46.00
## 1st Qu.:0.1173 1st Qu.:1 1st Qu.:0.1173 1st Qu.:1 1st Qu.: 53.00
## Median :0.1438 Median :1 Median :0.1438 Median :1 Median : 65.00
## Mean :0.1756 Mean :1 Mean :0.1756 Mean :1 Mean : 79.35
## 3rd Qu.:0.1969 3rd Qu.:1 3rd Qu.:0.1969 3rd Qu.:1 3rd Qu.: 89.00
## Max. :1.0000 Max. :1 Max. :1.0000 Max. :1 Max. :452.00
##
## mining info:
## data ntransactions support confidence
## transactions 452 0.1 0.5
## call
## apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)
inspect(head(data.ass,20))
## lhs rhs support confidence coverage lift count
## [1] {} => {final=final-success} 1.0000000 1 1.0000000 1 452
## [2] {higher=higher-yes} => {final=final-success} 0.9623894 1 0.9623894 1 435
## [3] {failures=failures-0} => {final=final-success} 0.9579646 1 0.9579646 1 433
## [4] {paid=paid-no} => {final=final-success} 0.9402655 1 0.9402655 1 425
## [5] {failures=failures-0,
## higher=higher-yes} => {final=final-success} 0.9269912 1 0.9269912 1 419
## [6] {failures=failures-0,
## paid=paid-no} => {final=final-success} 0.9048673 1 0.9048673 1 409
## [7] {paid=paid-no,
## higher=higher-yes} => {final=final-success} 0.9048673 1 0.9048673 1 409
## [8] {schoolsup=schoolsup-no} => {final=final-success} 0.9004425 1 0.9004425 1 407
## [9] {Pstatus=Pstatus-T} => {final=final-success} 0.8761062 1 0.8761062 1 396
## [10] {failures=failures-0,
## paid=paid-no,
## higher=higher-yes} => {final=final-success} 0.8761062 1 0.8761062 1 396
## [11] {failures=failures-0,
## schoolsup=schoolsup-no} => {final=final-success} 0.8650442 1 0.8650442 1 391
## [12] {schoolsup=schoolsup-no,
## higher=higher-yes} => {final=final-success} 0.8628319 1 0.8628319 1 390
## [13] {schoolsup=schoolsup-no,
## paid=paid-no} => {final=final-success} 0.8517699 1 0.8517699 1 385
## [14] {Pstatus=Pstatus-T,
## higher=higher-yes} => {final=final-success} 0.8429204 1 0.8429204 1 381
## [15] {Pstatus=Pstatus-T,
## failures=failures-0} => {final=final-success} 0.8407080 1 0.8407080 1 380
## [16] {failures=failures-0,
## schoolsup=schoolsup-no,
## higher=higher-yes} => {final=final-success} 0.8340708 1 0.8340708 1 377
## [17] {Pstatus=Pstatus-T,
## paid=paid-no} => {final=final-success} 0.8230088 1 0.8230088 1 372
## [18] {failures=failures-0,
## schoolsup=schoolsup-no,
## paid=paid-no} => {final=final-success} 0.8230088 1 0.8230088 1 372
## [19] {schoolsup=schoolsup-no,
## paid=paid-no,
## higher=higher-yes} => {final=final-success} 0.8163717 1 0.8163717 1 369
## [20] {Pstatus=Pstatus-T,
## failures=failures-0,
## higher=higher-yes} => {final=final-success} 0.8119469 1 0.8119469 1 367
We have 197 dataset and we can see summary of them. I removed rare observations that has higher itemFrequency than 0.05. I choose minimum support level 0.1, and minimum confidence level 0.5.
library(arulesCBA)
trans2=df
for (i in 1:29) {
trans2[,i]<-factor(trans2[,i],levels = unique(trans2[,i]))
}
trans22=trans2 %>% filter(final=='final-fail')
data.disc=discretizeDF.supervised(data=trans22,final~.,methods = 'chi2')
summary(data.disc)
## school sex address famsize Pstatus
## school-GP: 85 sex-F:102 address-U:117 famsize-GT3:140 Pstatus-A: 24
## school-MS:112 sex-M: 95 address-R: 80 famsize-LE3: 57 Pstatus-T:173
##
##
##
## Medu Fedu Mjob Fjob
## Medu-4:24 Fedu-4:21 Mjob-at_home :57 Fjob-teacher : 3
## Medu-1:68 Fedu-1:79 Mjob-health :12 Fjob-other :106
## Medu-3:45 Fedu-2:67 Mjob-other :82 Fjob-services: 64
## Medu-2:59 Fedu-3:28 Mjob-services:38 Fjob-health : 7
## Medu-0: 1 Fedu-0: 2 Mjob-teacher : 8 Fjob-at_home : 17
## reason guardian traveltime
## reason-course :109 guardian-mother:145 traveltime-2:76
## reason-other : 28 guardian-father: 35 traveltime-1:93
## reason-home : 33 guardian-other : 17 traveltime-3:21
## reason-reputation: 27 traveltime-4: 7
##
## studytime failures schoolsup famsup
## studytime-2:84 failures-0:116 schoolsup-yes: 23 famsup-no : 82
## studytime-3:15 failures-3: 13 schoolsup-no :174 famsup-yes:115
## studytime-1:91 failures-1: 55
## studytime-4: 7 failures-2: 13
##
## paid activities nursery higher
## paid-no :185 activities-no :110 nursery-yes:156 higher-yes:145
## paid-yes: 12 activities-yes: 87 nursery-no : 41 higher-no : 52
##
##
##
## internet romantic famrel freetime
## internet-no : 61 romantic-no :110 famrel-4:83 freetime-3:67
## internet-yes:136 romantic-yes: 87 famrel-5:60 freetime-2:20
## famrel-3:33 freetime-4:63
## famrel-1:11 freetime-1:16
## famrel-2:10 freetime-5:31
## goout Dalc Walc health final
## goout-4:42 Dalc-1:117 Walc-1:64 health-3:36 final-success: 0
## goout-3:55 Dalc-2: 44 Walc-3:37 health-5:84 final-fail :197
## goout-2:35 Dalc-5: 9 Walc-2:40 health-1:25
## goout-1:19 Dalc-3: 21 Walc-4:36 health-2:23
## goout-5:46 Dalc-4: 6 Walc-5:20 health-4:29
data.trans<-transactions(data.disc)
trans2<-data.trans[, itemFrequency(data.trans)>0.05]
data.ass<-mineCARs(final~ ., transactions=trans2, support=0.1, confidence=0.5)
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE FALSE 5 0.1 1
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 19
##
## set item appearances ...[87 item(s)] done [0.00s].
## set transactions ...[87 item(s), 197 transaction(s)] done [0.00s].
## sorting and recoding items ... [87 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.10s].
## writing ... [30188 rule(s)] done [0.00s].
## creating S4 object ... done [0.03s].
data.ass=sort(data.ass,by='support',decreasing = TRUE)
summary(data.ass)
## set of 30188 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4 5
## 1 75 1274 7822 21016
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 4.649 5.000 5.000
##
## summary of quality measures:
## support confidence coverage lift count
## Min. :0.1015 Min. :1 Min. :0.1015 Min. :1 Min. : 20.00
## 1st Qu.:0.1117 1st Qu.:1 1st Qu.:0.1117 1st Qu.:1 1st Qu.: 22.00
## Median :0.1320 Median :1 Median :0.1320 Median :1 Median : 26.00
## Mean :0.1529 Mean :1 Mean :0.1529 Mean :1 Mean : 30.12
## 3rd Qu.:0.1726 3rd Qu.:1 3rd Qu.:0.1726 3rd Qu.:1 3rd Qu.: 34.00
## Max. :1.0000 Max. :1 Max. :1.0000 Max. :1 Max. :197.00
##
## mining info:
## data ntransactions support confidence
## transactions 197 0.1 0.5
## call
## apriori(data = transactions, parameter = parameter, appearance = list(rhs = vars$class_items, lhs = vars$feature_items), control = control)
inspect(head(data.ass,20))
## lhs rhs support confidence coverage lift count
## [1] {} => {final=final-fail} 1.0000000 1 1.0000000 1 197
## [2] {paid=paid-no} => {final=final-fail} 0.9390863 1 0.9390863 1 185
## [3] {schoolsup=schoolsup-no} => {final=final-fail} 0.8832487 1 0.8832487 1 174
## [4] {Pstatus=Pstatus-T} => {final=final-fail} 0.8781726 1 0.8781726 1 173
## [5] {schoolsup=schoolsup-no,
## paid=paid-no} => {final=final-fail} 0.8274112 1 0.8274112 1 163
## [6] {Pstatus=Pstatus-T,
## paid=paid-no} => {final=final-fail} 0.8223350 1 0.8223350 1 162
## [7] {nursery=nursery-yes} => {final=final-fail} 0.7918782 1 0.7918782 1 156
## [8] {Pstatus=Pstatus-T,
## schoolsup=schoolsup-no} => {final=final-fail} 0.7664975 1 0.7664975 1 151
## [9] {paid=paid-no,
## nursery=nursery-yes} => {final=final-fail} 0.7411168 1 0.7411168 1 146
## [10] {higher=higher-yes} => {final=final-fail} 0.7360406 1 0.7360406 1 145
## [11] {guardian=guardian-mother} => {final=final-fail} 0.7360406 1 0.7360406 1 145
## [12] {Pstatus=Pstatus-T,
## schoolsup=schoolsup-no,
## paid=paid-no} => {final=final-fail} 0.7157360 1 0.7157360 1 141
## [13] {famsize=famsize-GT3} => {final=final-fail} 0.7106599 1 0.7106599 1 140
## [14] {Pstatus=Pstatus-T,
## nursery=nursery-yes} => {final=final-fail} 0.7055838 1 0.7055838 1 139
## [15] {schoolsup=schoolsup-no,
## nursery=nursery-yes} => {final=final-fail} 0.6954315 1 0.6954315 1 137
## [16] {internet=internet-yes} => {final=final-fail} 0.6903553 1 0.6903553 1 136
## [17] {guardian=guardian-mother,
## paid=paid-no} => {final=final-fail} 0.6903553 1 0.6903553 1 136
## [18] {paid=paid-no,
## higher=higher-yes} => {final=final-fail} 0.6852792 1 0.6852792 1 135
## [19] {famsize=famsize-GT3,
## paid=paid-no} => {final=final-fail} 0.6598985 1 0.6598985 1 130
## [20] {Pstatus=Pstatus-T,
## guardian=guardian-mother} => {final=final-fail} 0.6598985 1 0.6598985 1 130
Student with the following characteristics has successfully passed the final exam.
However, Students with the following characteristics failed to pass the final exam.