1 Introduction

1.1 Market Basket Analysis

Market Basket Anlysis (MBA) is a data mining technique used by retailers to increase sales by better understanding customer purchasing patterns. When we go to the machine learning terms Market Basket Analysis can be categorized as unsupervised learning technique that help to analyzing transactional data. This technique is usually used to analyzing the purchasing pattern of costumers. In example

{T-shirt,Trousers}⇒{Jacket}

The rules above can be states as if someone bought T-shirt and Trousers, then Jacket is also likely to be purchased. From the example above, it is seems that MBA is a very important analysis technique in the retail and sales area, but surprisingly MBA or Association Rules Mining also can be a powerful tools that can be used in many scenario.

In this Example I will try to use MBA as a technique to find the association of The consumption of alcohol by students with a “student alcoholic consumptions” datasets from kaggle.

1.2 Apriori Algorithm

When we talk about Market Basket Analysis or Association Rules Mining, there is one algorithm that comes to mind which is Apriori Algorithm

From the wikipedia it is said that:

The frequent item sets determined by Apriori can be used to determine association rules which highlight general trends in the database: this has applications in domains such as market basket analysis.

For more information about apriori algorithm you can clik here or here

In the table above, we can see there is four transactions from a supermarket. The item sets are

\[ \begin{align} I = {Tooth brush, Tooth paste, Mouth wash, Jam , Peanut butter, Bread, cereal, milk, T-shirt, Trousers} \end{align} \]

and the transactions sets,

\[ \begin{align} T = {T1, T2, T3, T4} \end{align} \]

For example,

\[ \begin{align} T1 = {Tooth brush, Tooth paste, Mouth wash}. \end{align} \]

Then the association rules is defined as:

\[ \begin{align} X⇒Y, where X⊂I, Y⊂I and X∩Y=0 \end{align} \] and from the transaction 1 (T1), it can be implies as

\[ \begin{align} {Tooth brush, Tooth paste} ⇒{Mouth wash} \end{align} \]

2 Library Packages

library(arules) #For Mining Association Rules
library(arulesViz) # For Visualizing Association Rules
library(tidyr) # For Tidying the Data
library(tidyverse) #For Data Manipulation and Visualization (Consist of Multiple R Package)

In this project we will use 4 library

arules : Use for Mining Association Rules
arulesViz : Use for the visualization of Association Rules
tidyr : Use for tidying the data
tidyverse : Use for data manipulation and visualization (Consist of Multiple R Package)

3 Data

data <- read.csv("student-por.csv")
head(data)

str(data)

## 'data.frame':    649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

The school alcoholic consumptions datasets consist of 649 observations and 33 variables originally. When we want to use this data with market basket analysis techniques we must transform all data types into factor. Thus in the next section, I try to transform all variable into factor type and also merge some variables.

4 Feature Engineering and Data preparation

If you wonder what feature engineering is, the simplest meaning is the process of using domain knowledge to extract features from raw data via data mining techniques. These features can be used to improve the performance of machine learning algorithms.

For more information you can click here

#Alcoholic consumption data Transformation
data$alc_cons <- (data$Dalc+data$Walc)/2

data$alc_cons <- ifelse(data$alc_cons>=2.5, "High", "Low")

#Parents Education Condition
data$parents_ed <- (data$Fedu+data$Medu)/2

data$parents_ed <- ifelse(data$parents_ed>2, "High Education", "Low Education")

#Grade Transformation
data$grade_imp <- ifelse(data$G1 < data$G3, "Improve", "Not Improve")

data$grade_ave <- (data$G1+data$G2+data$G3)/3

data$grade <- ifelse(data$grade_ave >= 12, "Above Average", "Below Average")

data$age <- ifelse(data$age >= 19 , "19-22", "15-18")

data$personality <- ifelse(data$freetime>=3 & data$goout>=3, "Extrovert","Introvert") 

data$famsize <- ifelse(data$famsize=="GT3", "Big", "Small")

data$like_school <- ifelse(data$absences>=3 & data$failures>2, "Yes","No")

data$ed_support <- ifelse(data$famsup == "yes" | data$schoolsup=="yes", "Yes", "No") 

data$failures <- ifelse(data$failures==0, "No","Yes")

data$traveltime <- ifelse(data$traveltime >2, "Long", "Short")

data$famrel <- ifelse(data$famrel >= 3, "Good", "Bad")

data$health <- ifelse(data$health >= 3, "Good", "Bad")

data$address <- ifelse(data$address=="U", "Urban", "Rural") 

data$parents_guidance <- ifelse(data$Mjob =="at_home" | data$Fjob=="at_home", "Yes", "No")

data$Pstatus <- ifelse(data$Pstatus=="A", "Apart", "Together") 

data$studytime <- ifelse(data$studytime >=3, "Long", "Short")

data$freetime <- ifelse(data$freetime >=3, "Many", "Few")

data <- data %>%
  select(-c(goout,absences,reason,Dalc,Walc,Fjob, Mjob,guardian,G1,G2,G3,grade_ave,schoolsup,famsup,Medu,Fedu))

data <- data %>%
  mutate_if(is.character,as.factor)

data <- data %>%
  select_if(is.factor)

str(data)

## 'data.frame':    649 obs. of  26 variables:
##  $ school          : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex             : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age             : Factor w/ 2 levels "15-18","19-22": 1 1 1 1 1 1 1 1 1 1 ...
##  $ address         : Factor w/ 2 levels "Rural","Urban": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize         : Factor w/ 2 levels "Big","Small": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus         : Factor w/ 2 levels "Apart","Together": 1 2 2 2 2 2 2 1 1 2 ...
##  $ traveltime      : Factor w/ 2 levels "Long","Short": 2 2 2 2 2 2 2 2 2 2 ...
##  $ studytime       : Factor w/ 2 levels "Long","Short": 2 2 2 1 2 2 2 2 2 2 ...
##  $ failures        : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ paid            : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities      : Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery         : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher          : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet        : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic        : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel          : Factor w/ 2 levels "Bad","Good": 2 2 2 2 2 2 2 2 2 2 ...
##  $ freetime        : Factor w/ 2 levels "Few","Many": 2 2 2 1 2 2 2 1 1 2 ...
##  $ health          : Factor w/ 2 levels "Bad","Good": 2 2 2 2 2 2 2 1 1 2 ...
##  $ alc_cons        : Factor w/ 2 levels "High","Low": 2 2 1 2 2 2 2 2 2 2 ...
##  $ parents_ed      : Factor w/ 2 levels "High Education",..: 1 2 2 1 1 1 2 1 1 1 ...
##  $ grade_imp       : Factor w/ 2 levels "Improve","Not Improve": 1 1 2 2 1 1 2 1 1 1 ...
##  $ grade           : Factor w/ 2 levels "Above Average",..: 2 2 1 1 1 1 1 1 1 1 ...
##  $ personality     : Factor w/ 2 levels "Extrovert","Introvert": 1 1 2 2 2 2 1 2 2 2 ...
##  $ like_school     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ed_support      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 2 ...
##  $ parents_guidance: Factor w/ 2 levels "No","Yes": 2 2 2 1 1 1 1 1 1 1 ...

Ater doing the feature engineering steps where I try to making a new features that can improve the model, removing some unecessary features or variables, and the most important steps is make all the variables to factor types, we can get the “clean” data that will be used to perform association rules/ market basket analysis with 649 observations and 26 variables.

And here is the definition about the variable

school : The school students attend (MS : Mousinho da Silveira, GP : Gabriel Pereira)
sex : The gender of students (M : Male, F: Female)
age : Age of the respondent (15-18 and 19-22)
address : The living area of students (Rural, Urban)
famsize : The Family size of students (big = 3 and above person, small = below 3 person)
Pstatus : Parents status (together, apart)
traveltime : How is the respondent travel time to school (Long (30 minutes and longer), Short (below 30 minutes))
studytime : Time that are consumed by student for study
failures : If respondent Have ever fail in class (Yes, No)
paid : If students paid for extra subject of Math or Portuguese (Yes, No)
activities : If students doing extra-curricular activities (Yes, No)
nursery : If students attend nursery school (Yes, No)
higher : If students wants to take higher education (Yes, No)
internet : If students have internet access at home (Yes, No)
romanctic : If students has a romantic relationship (Yes, No)
famrel : The students family relations (Good, Bad)
freetime : Availability free time of the students
health : The students Health Conditions (Good, Bad)
alc_cons : The students alcohol consumption rate (High, Low)
parents_ed : The student’s parents education (High, Low)
grade_imp : If the students G1 < G3 it is improve (Improve, Not improve)
grade : If each students 3 grades average is higher than the total score average it is above average (Above average, below average)
personality : Personality of the students (Introvert and Extrovert), based on rate of freetime and going out
like_school : If responednts Like school or not (Yes, No), based on absence and failures
parents_guidance : if either father or mother of the students work at home (Yes, No)
ed_support : If the students have educational support either from parents or school (Yes, No)

If you wonder what is the “Transaction” and what is the “Items” because you can’t find any variables name as “Transaction” and “Items”. Don’t worry, in this datasets we used “alc_cons” as our “Transactions” variable and the rest of variables as our “Items” variables.

5 The Modelling Process

In This part we will try to make the model from the data, we will divided it into two parts, whereas searching for which factor leads to high consumption of alcohol and which factor leads to low consumption of alcohol.

Before we go to the modelling process, it is better if yoou know these terms first:

1. Support

Support is the percentage of transactions that contain all of the items in an itemset example T1 = {Item A, Item B} . The higher the support the more frequently the itemset occurs. Rules with a high support are preferred since they are likely to be applicable to a large number of future transactions.

and how to calculate support,

\[ \begin{align} Support(Item A\Rightarrow ItemB) &=Pr(ItemA,ItemB)&=\dfrac{count(ItemA,ItemB)}{N} \end{align} \] where N represent the total number of transactions

2. Confidence

Confidence the probability that a transaction that contains the items on the left hand side of the rule also contains the item on the right hand side. The higher the confidence, the greater chance that the item on the right hand side will be purchased.

and this is how to calculate confidence,

\[ \begin{align} Confidence(ItemA\Rightarrow ItemB) &=\dfrac{support(ItemA,ItemB)}{support(ItemA)} \end{align} \]

3. Lift

Lift is the support divided by the product of the probabilities of the items on the left and right hand side occurring as if there was no association between them.

and this is how to calculate lift,

\[ \begin{align} Lift(A\Rightarrow B) &=\dfrac{support(A,B)}{Pr(A)Pr(B)}&=\dfrac{Pr(A,B)}{Pr(A)Pr(B)}&=\dfrac{Pr(B|A)}{Pr(B)} \end{align} \]

These are the implications of lift

When lift is 0 - 1, there is no relationship at all.
When lift is more than 1, the transaction of the item is more likely to happen
When lift is lower than 0, the transaction of the item is less likely to happen

Now let’ do the modelling and try to get the take away from this data sets.

5.2.1 High Alcohol Consumptions

mba_high <- apriori(data, parameter = list(sup = 0.01, conf = 0.5, target="rules",minlen=2,maxlen=3), appearance = list(rhs= "alc_cons=High", default = "lhs"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.01      2
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 6 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[52 item(s), 649 transaction(s)] done [0.00s].
## sorting and recoding items ... [52 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

inspect(head(sort(mba_high, by="confidence"),10))

##      lhs                                   rhs             support   
## [1]  {sex=M,higher=no}                  => {alc_cons=High} 0.03543914
## [2]  {sex=M,famrel=Bad}                 => {alc_cons=High} 0.01694915
## [3]  {famrel=Bad,ed_support=No}         => {alc_cons=High} 0.01848998
## [4]  {sex=M,traveltime=Long}            => {alc_cons=High} 0.03235747
## [5]  {sex=M,failures=Yes}               => {alc_cons=High} 0.04314330
## [6]  {sex=M,grade=Below Average}        => {alc_cons=High} 0.14791988
## [7]  {paid=yes,personality=Extrovert}   => {alc_cons=High} 0.01694915
## [8]  {famrel=Bad,personality=Extrovert} => {alc_cons=High} 0.02311248
## [9]  {famsize=Small,traveltime=Long}    => {alc_cons=High} 0.01848998
## [10] {age=19-22,higher=no}              => {alc_cons=High} 0.01386749
##      confidence coverage   lift     count
## [1]  0.6764706  0.05238829 2.274764 23   
## [2]  0.6470588  0.02619414 2.175861 11   
## [3]  0.6315789  0.02927581 2.123807 12   
## [4]  0.6176471  0.05238829 2.076958 21   
## [5]  0.6086957  0.07087827 2.046857 28   
## [6]  0.5889571  0.25115562 1.980483 96   
## [7]  0.5789474  0.02927581 1.946823 11   
## [8]  0.5769231  0.04006163 1.940016 15   
## [9]  0.5714286  0.03235747 1.921540 12   
## [10] 0.5625000  0.02465331 1.891516  9

From the first rules we can implies, Male students who do not want to take higher education is 3.54%(support) from all the datasets. These category of students likely to have a high consumption of alcohol by 67,64%(confidence). If you are a male student and do not want to take higher education you are 2.27(lift) times more likely to have a high consumption of alcohol.

I also try to visualize the result above,

plot(mba_high)

plot(mba_high[1:10], method = "graph")

plot(mba_high[1:10], method="graph", control=list(layout=igraph::in_circle()))

5.2.2 Low Alcohol Consumptions

mba_low <- apriori(data, parameter = list(sup = 0.5, conf = 0.7, target="rules",minlen=2,maxlen=3), appearance = list(rhs= "alc_cons=Low", default = "lhs"))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.7    0.1    1 none FALSE            TRUE       5     0.5      2
##  maxlen target  ext
##       3  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 324 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[52 item(s), 649 transaction(s)] done [0.00s].
## sorting and recoding items ... [26 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3

## Warning in apriori(data, parameter = list(sup = 0.5, conf = 0.7, target =
## "rules", : Mining stopped (maxlen reached). Only patterns up to a length of 3
## returned!

##  done [0.00s].
## writing ... [49 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(mba_low)

## set of 49 rules
## 
## rule length distribution (lhs + rhs):sizes
##  2  3 
## 10 39 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   2.796   3.000   3.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.5008   Min.   :0.7020   Min.   :0.6857   Min.   :0.9992  
##  1st Qu.:0.5424   1st Qu.:0.7118   1st Qu.:0.7519   1st Qu.:1.0130  
##  Median :0.5794   Median :0.7180   Median :0.8213   Median :1.0219  
##  Mean   :0.5854   Mean   :0.7178   Mean   :0.8158   Mean   :1.0216  
##  3rd Qu.:0.6240   3rd Qu.:0.7231   3rd Qu.:0.8675   3rd Qu.:1.0292  
##  Max.   :0.6965   Max.   :0.7393   Max.   :0.9877   Max.   :1.0522  
##      count      
##  Min.   :325.0  
##  1st Qu.:352.0  
##  Median :376.0  
##  Mean   :379.9  
##  3rd Qu.:405.0  
##  Max.   :452.0  
## 
## mining info:
##  data ntransactions support confidence
##  data           649     0.5        0.7

plot(mba_low)

inspect(head(sort(mba_low, by="confidence"),10))

##      lhs                               rhs            support   confidence
## [1]  {failures=No,nursery=yes}      => {alc_cons=Low} 0.5069337 0.7393258 
## [2]  {traveltime=Short,failures=No} => {alc_cons=Low} 0.5624037 0.7344064 
## [3]  {failures=No,famrel=Good}      => {alc_cons=Low} 0.5778120 0.7338552 
## [4]  {nursery=yes,higher=yes}       => {alc_cons=Low} 0.5300462 0.7334755 
## [5]  {failures=No,higher=yes}       => {alc_cons=Low} 0.5762712 0.7290448 
## [6]  {age=15-18,nursery=yes}        => {alc_cons=Low} 0.5454545 0.7283951 
## [7]  {higher=yes,famrel=Good}       => {alc_cons=Low} 0.6024653 0.7281192 
## [8]  {traveltime=Short,nursery=yes} => {alc_cons=Low} 0.5208012 0.7253219 
## [9]  {failures=No,paid=no}          => {alc_cons=Low} 0.5808937 0.7250000 
## [10] {nursery=yes,famrel=Good}      => {alc_cons=Low} 0.5346687 0.7244259 
##      coverage  lift     count
## [1]  0.6856703 1.052242 329  
## [2]  0.7657935 1.045241 365  
## [3]  0.7873652 1.044456 375  
## [4]  0.7226502 1.043916 344  
## [5]  0.7904468 1.037610 374  
## [6]  0.7488444 1.036685 354  
## [7]  0.8274268 1.036292 391  
## [8]  0.7180277 1.032311 338  
## [9]  0.8012327 1.031853 377  
## [10] 0.7380586 1.031036 347

From the first rules we can implies, students who do not fail in any class and attend nursery school is 50,69%(support) from all the datasets. These category of students likely to have a low consumption of alcohol by 73,93%(confidence). If you are a student who do not fail in any class and attend nursery school you are 1.05(lift) times more likely to have a low consumption of alcohol.

And, the plot below is the visualization of the low consumption

plot(mba_low[1:10], method="graph")

plot(mba_low[1:10], method="grouped")

plot(head(sort(mba_low,by="lift"),10),method="graph")

6 Conclusion

Market basket analysis is a very useful techniques to analyze data. Traditionally it only use for a transaction data but guess what it is not. You can do this technique using all type of datasets but do not forget to change it to a factor data type first. Hopefully this will help you doing your own MBA analysis

Thank you :)

Market Basket Analysis (Association Rules Mining)

Gerald Bryan

11/11/2020